David:

This is not a citation, but can help you see why “researchers can regularly get p less than .05 even in the presence of pure noise”:

Go to

http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

and look at the items labeled:

Jerry Dallal’s Simulation of Multiple Testing

Jelly Beans (A Folly of Multiple Testing and Data Snooping)

More Jerry Dallal Simulations: More Jelly Beans Cellphones and Cancer Coffee and …

]]>Another empirical option from another standpoint (this time about choice of standard error estimates in difference-in-difference estimates) from the applied microeconomics literature:

http://zmjones.com/static/causal-inference/bertrand-qje-2004.pdf

There are actually lots of ways for researchers to end up with massively over-rejecting “true” null hypotheses. Andrew’s (well, Simmons & co.’s) paper points to behaviors when the probability calculations are at least theoretically close to right. The problem the paper I link to is focused on using methods that seem like they should produce reasonable probability calculations (coverage rates of confidence intervals) but in application fail to produce anything close to reasonably-sized confidence intervals. And neither of these things touch issues like the Garden of Forking Paths, where the coefficients reported really might be “statistically significant” in terms of their magnitude relative to their precision, but are not actually meaningful or consistent features of the world… you can find improbable draws in any dataset, and given the flexibility of theory in many social sciences, you can always find a way to reasonably interpret those “effects” in terms of the underlying theory… I’d count that as a way for researchers to routinely find p<0.05 in the presence of pure noise, but I don't have an empirical demonstration, simulation, or placebo test reference for that (it is really just the result of probability calculations, multiple hypothesis testing over many outcomes/subgroups, and the flexibility of theory to explain anything).

]]>David:

The classic reference on this one is Simmons, Nelson, and Simonsohn (2011).

]]>Thank you for your elucidating article Andrew.

I am intrigued, but have searched (not terribly extensively, but did do a cursory scan of your published articles) in vain to find a citation for your “researchers can regularly get p less than .05 even in the presence of pure noise” – which I would like to use in an article I’m writing.

Any suggestions ?

Thank you again,

David

I saw the course has been posted. Common Mistake in Using Statistics: Spotting Them and Avoiding Them.

Many thanks for the link, a lot of good information.

May 2015 SSI Course (or 2016 when I get it posted in late May): Common Mistake in Using Statistics: Spotting Them and Avoiding Them.

Biostatistics Guest Lecture

M358K Instructor Materials

Blog: Musings on Using and Misusing Statistics

]]>An observation on the problem of teaching of P-values. I come to this from oceanography/environmental science. Scientists in my community have been so indoctrinated into the P-value concept that they often stop thinking. A colleague just took a college stats course, taught by an ecologist, and P-values, null hypothesis testing, was essential.

P-values are too common, even when irrelevant. But the worst part is that many of our observational

data sets are not samples, they are actually the population. So a P-test is applied to “how much is rainfall increasing in NY over the last 30 years?” “What is the relationship between satellite biomass and temperature over Long Island Sound in the last 15 years”? This use of P values to include or exclude complete data sets is common and accepted. (good commentary by Nicholls, 2001, Bulletin of American Meteorological Society.)

I see papers accepting ecologically insignificant trends because p<0.05, they accept trends that are false because they should not have used least squares regression (like an El Nino producing an outlier at one end of the data set) because p 0.05. And then there are the data mining papers.

I’ve tried to work this a review at a time, and ask people to report effect sizes, misfit, or uncertainties and to deemphasize or delete p-values; and also to use appropriate statistics.

(Unfortunately, I’m also self-taught, and pretty much a hack.) The teaching needs to change, and the

editorial practices at the journals I know of also need to change. If the ASA makes a bigger deal

about the elementary stuff, that might help.

Bert:

Freedman was a good writer, even if much of what he wrote made no sense.

]]>http://www.math.rochester.edu/people/faculty/cmlr/Advice-Files/Freedman-Shoe-Leather.pdf

(see also some of his references)

I will not comment further.

Cheers,

Bert

Shravan,

Thanks for the link to the R-Blogger piece. I would not call it a defense of p-values so much as a caution not to interpret the ASA report as saying don’t use p-values — in particular, pointing out “responsible” ways of using p-values, and that the cautions about p-values also apply to many other statistical techniques.

]]>“I have a hypothesis that selective sustained attention is mediated by verbal representations. So I set up an experiment in which individuals complete several trials on which they instructed to track a moving target novel shape in the midst of distractors and indicate where it was before it disappeared from the screen. An experimental group is taught names for the target shapes, while the control group is familiarized with the shapes but receives no label training. I predict the experimental group will be better on average at tracking the target shapes than the control group.”

http://statmodeling.stat.columbia.edu/2016/03/07/29212/#comment-265856

First, lets get rid of the statistical aspect of this problem. Forget p-values, bayes factors, and all of that. Assume we know for a fact, with 100% confidence, that the experimental group always performs better than the control group under these conditions. The problem is that there are other explanations you need to address. Further, this list is going to be, for all practical purposes, endless because your prediction is too vague. Here are a few:

1) Having names for the target shapes makes them more memorable and thus easier to track.

2) The “instructor” verbally assigning the labels spends more time on those that are on the screen most often (or otherwise leaks information somehow).

3) The process of familiarizing the control group actually confuses them, or tires them out, etc

4) The names given by the instructor are shorter or more memorable than the labels each subject would “self-assign” on average

5) The experimental group gets more “training” with the shapes because it takes extra time to verbally assign the labels

Some of these alternatives may be interesting in their own right, but others would just be boring experimental artifacts. So p under .05 does not mean anything interesting is going on. On the other hand, if the difference between the two groups was exactly zero on average, that would also be an interesting result. How is it that people perform so consistently on this task?

]]>Here is a defence of p-values:

]]>Sabine, you wrote:

” I predict the experimental group will be better on average at tracking the target shapes than the control group. Assuming a difference in the predicted direction, I test whether the difference is surprisingly large under the null hypothesis. Assume my sample is a healthy size, capable of the effect if there is one. And also assume there were attempts to measure the dv with precision (e.g., using multiple observations, etc.) And lo and behold, p is less than .05.

Can you tell me why this is not informative?”

Let’s say you expected mu to be positive.

Two possible scenarios with p<0.05:

1. Your sample mean is positive in sign and you get the p<0.05. You can rule out (with alpha probability of being wrong under hypothetical repeated runs of the experiment) that there is no effect (I.e., mu=0).

This is now a publishable result.

2. Your sample mean is negative in sign and you get the p<0.05. You can rule out (with alpha probability of being wrong under hypothetical repeated runs of the experiment) that there is no effect (I.e., mu=0).

This is no longer publishable as your hypothesis was not supported by the sample means.

The p-value in both cases gave you the same information (mu!=0, possibly), but the decision as to whether you have support for your particular hypothesis doesn't come from the p-value at all. In (1) we would be happy, and in (2) we would be sad. For the same p-value.

Maybe read Gelman et al on Type S and M errors and run some simulations to understand what this really means for your studies. This stuff is not just theoretical: as an exercise, for your kind of research question, try running the same experiment five times with real subjects (literally the same setup, different subjects from the same subject pool) and watch the means flip-flop. That is what is happening to me. I take a published result in a major journal, replicate it as exactly as I can, and get the opposite pattern or a mean close to 0. Even my own experiments' results flip flop all over the place. Harvard professors might suspect that maybe I just don't know how to do experiments (I don't work in a prestigious university, i.e., not in the US). It's possible. So try it out yourself.

]]>Sabine,

(I got this in the wrong place in the thread, so here is another better-placed try.)

In your March 11, 10:30 pm comment, you say, “I test whether the difference is surprisingly large under the null hypothesis.”

How do you propose to test for this?

]]>E:

That reminds me of a story that I’ll have to tell here sometime . . . Anyway, short answer to your question is that it’s not necessary to add water to the tubes randomly, you should just include tube number as a regression predictor in your analysis. We discuss this sort of thing in chapter 8 of BDA3 and chapter 9 of ARM.

]]>molecular biology consists largely of moving little drops of water (or 99.9% water with buffer or protein or DNA or whatever) from tube to tube, often in sets of ten or twelve, tube one is condition one, tube two is replicate, tube three is variation on one, etc

To keep sane, and make sure you do not make a mistake, people do tube one, tube 2….

A stats person would be to add water to the tubes randomly, but then you would never do the experiment right

until, since these are all done totally by hand

any help here ???

]]>Thanks for your elaboration of this point. I think the problem I am having is that the discussion is divorced from specific examples of what I currently think are defensible uses of null hypothesis testing that yield information worth having, and thus it’s difficult for me to evaluate if the argument strongly implies that p-values are worthless.

So here is a less controversy-laden hypothetical. I have a hypothesis that selective sustained attention is mediated by verbal representations. So I set up an experiment in which individuals complete several trials on which they instructed to track a moving target novel shape in the midst of distractors and indicate where it was before it disappeared from the screen. An experimental group is taught names for the target shapes, while the control group is familiarized with the shapes but receives no label training. I predict the experimental group will be better on average at tracking the target shapes than the control group. Assuming a difference in the predicted direction, I test whether the difference is surprisingly large under the null hypothesis. Assume my sample is a healthy size, capable of the effect if there is one. And also assume there were attempts to measure the dv with precision (e.g., using multiple observations, etc.) And lo and behold, p is less than .05.

Can you tell me why this is not informative? Sure, one could say that there may be other reasons than the labels that could explain why participants do better in the labeling condition, but careful design/matching of conditions on all aspects other than the manipulation could minimize this possibility. One could also say this isn’t a test of a well-developed theory, or a comparison of theories, and therefore not an interesting contribution, but I think this is debatable. I may be really interested in whether language plays a role in such cognitive processes, and this seems to be a suitable framework that allows me to test the question I’m interested in.

]]>Anoneuoid,

Nice quote and link. The Meel idea for an appropriate null hypothesis is a gem.

]]>Martha,

Thanks for alerting me to Bookstein. There is an interesting teaser here:

“In place of p-values there is an unusual concentration on crucial details of measurement—where suites of variables come from, how calibration of machines can maximize their reproducibility—that are almost always overlooked in textbooks of statistical method. The exceptions to this generalization, such as the 1989 book by Harald Martens and Tormod Næs on multivariate analysis of mass spectrograms or (I note modestly) my 1991 book on the biomathematical foundations of morphometrics, arerare but, when successful, prove to be citation classics partly by reason of that rarity.

But they all share one central rhetorical concern: consilience (Dogma 6), the convergence of evidence from multiple sources. Now consilience requires a relatively deep understanding of the way that such multiple sources relate to a common hypothesis. To have a reasonable chance of making sense in these domains we must take real (physical, biophysical) models of system behavior (the organism on its own, and the organism in interaction with our instruments) as seriously as we typically take abstract (statistical) models of noise or empirical covariance structure. Serious frustrations and paradoxes can easily arise in this connection. Over in the psychological sciences, Paul Meehl once wisecracked that most pairs of variables are correlated at the so-called “crud factor” level of ±0.25 or so. It is this correlation, not a correlation of zero, that represents an appropriate null hypothesis in these sciences. Closer to home, in my own application domain of morphometrics, landmark shape distributions are never spherical in shape space. The broken symmetries are properly taken not as algebraic defects in our formulas, but as biological aspects of the real world; they are signal rather than lack of fit. A few years ago Kanti Mardia, John Kent and I published a quite different model for a-priori ignorance in morphometrics, an intrinsic random field model in which noise is self-similar at every scale. I am still awaiting news that somebody has tried to fit their data to that.”

https://www1.maths.leeds.ac.uk/statistics/workshop/lasr2010/proceedings/L2010-05.pdf

]]>Sabine,

Anoneuoid’s last two paragraphs make some good points. To see some instances of them, you might try reading Measuring and Reasoning, by Fred Bookstein.

In response to Sabine’s 10:44 am comment, where she said:

“I’ve learned a bit about Bayesian statistics (hoping to learn more) and I can see how this approach may be of more value than null hypothesis testing, but I believe Andrew suggests that using Bayes factors is no better than using p-values.”

Yes, Andrew does not support using Bayes factors. I believe this is at least in part because they still have the dichotomous nature of p-values, but I believe there are other problems with them as well (but don’t recall at the moment what they are). My understanding is that his preferred approach is to model each problem individually, and use the posterior to help understand what is going on and make decisions.

Also, your comment “what I’m really trying to understand is why someone would argue that they have no good use even under ideal circumstances” seems to neglect the reality that ideal circumstances just do not seem to exist in real life problems.

]]>Sabine replied to Shravan,

“My interpretation of Andrew’s comments is that even if all experiments involved your situation c, he would still reject null hypothesis testing”

As Shravan said, we don’t know whether a given experiment fits situation a, b, or c, so your question is not one about the real world.

But would Andrew still reject null hypothesis testing even if the world were a fantasy one where all experiments were in situation c? I don’t really know — I can’t read his mind. He might, because one of his objections to null hypothesis testing in the real world is that they posit a “yes, no” situation. But he might not, if indeed the real world were a simpler one. But the question is really moot, since we live the the real world we live in, not in a fantasy one.

]]>Sabine, plug in a number like 2.49841 into the inverse equation solver here: http://mrob.com/pub/ries/

After a few seconds, you should see about two dozen relatively simple algebraic equations that approximate that number to within 10%. If that value was an experimental result, there could be multiple theories that allow deduction of each of those algebraic equations. Now imagine if we allowed all algebraic functions that were consistent with a value in the same direction (ie positive or negative). How many different theories would be consistent with the result then?

Here is my point: To distinguish between different explanations for an observed pattern/result, we need both theories making precise predictions and precise measurements. The “alternative hypothesis” that usually maps to the research hypothesis is too vague to be of any use in distinguishing between different explanations for the observations. Default-nil-NHST is of no use to the scientist who wants to distinguish between different plausible explanations. This is innate to that procedure, making it fatally flawed.

While it is not an innate property of NHST, the worse problem is that in practice this procedure allows proliferation of vague pseudosciency theories and discourages collection of precise observations. Researchers never collect precise observations, because they only feel the need to rule out the “null” hypothesis. Theorists can’t distinguish between their theories, because the observations are too imprecise (and biased because only results where a large difference between two conditions was observed are published).

]]>Also, if you have high power this doesn’t mean you can assume the null is false even before you do the study, right? (Otherwise why do the null hypothesis test). So once you do the study, you are still surrounded by fog: are in you in the world where the true mu=0, or are you in the world where the true mu has the sign and magnitude you expect it to have? The p-value might tell you mu!=0, but what you really want to know is whether the true mu has the sign and magnitude you expect it to have. The p-value answers the wrong question, one you didn’t really want an answer to.

Even if you think you have high power, you would still want to replicate your result to confirm that you can get a robust result. And there the informative thing is not the p-value but the replicability of the result.

]]>Hi Sabine, you write

” in my hypothetical example I’m using the p-value to test whether a difference in a predicted direction would be highly surprising/improbable under the null. I’m assuming I have good power to detect the effect.”

Presumably you’d have to have looked at previous work to estimate power. Due to the existence of Type S and M error, your estimate could be a wild overestimate (people often don’t publish a result if it comes out significant but in the wrong direction—this is how people manage to have 30-40 year long careers with consistent results that nobody else can replicate). So your assumption that you have high power is not a certainty but just a hope. That’s why you’re still left face to face with a p-value and the three situations (a), (b), (c), and you don’t know which possible world you are in. The single p-value and the associated null hyp test will tell you nothing about the actual hypothesis for this reason. Replications will (but they don’t need p-values, just consistent outcomes).

]]>Thanks! I’ve learned a bit about Bayesian statistics (hoping to learn more) and I can see how this approach may be of more value than null hypothesis testing, but I believe Andrew suggests that using Bayes factors is no better than using p-values.

I understand that much of the dissatisfaction with using p-values arises from how they are misused, but my point is that when used properly they do seem to be informative, and I think the logic behind null hypothesis testing is elegant (if often misconstrued). Again, what I’m really trying to understand is why someone would argue that they have no good use even under ideal circumstances (the straw man null hypothesis comment).

]]>Okay, but in my hypothetical example I’m using the p-value to test whether a difference in a predicted direction would be highly surprising/improbable under the null. I’m assuming I have good power to detect the effect. True, this could still be a false positive, but that’s besides the point. I’m asking why, even under these ideal circumstances, would someone argue that null hypothesis testing is useless. My interpretation of Andrew’s comments is that even if all experiments involved your situation c, he would still reject null hypothesis testing, and I want to know why and whether there is sound justification for that or whether it is going too far.

]]>Ailce:

What you describe as “Bayesian” is not the way I like to do Bayesian statistics. See chapter 7 of BDA3 or my 1995 paper with Rubin for more on this point.

]]>So, in a frequentist framework, you’re still comparing models by setting one as the “null hypothesis” and then computing a p-value. But there are lots of other conceptual frameworks for this.

Bayesian: assign a prior probability to each model, combine that with the model’s likelihood to get a probability of each model being the true model. You can use Bayes factors as p-value analogues.

Information-theoretic: compute AIC or BIC for each model (likelihood with a complexity penalty) to get an estimate of which model is “best” (either for prediction (AIC) or as an approximation to the Bayesian approach (BIC)). Use *IC weights to get weights for each model.

Machine learning: use cross-validation or a held out data set and check each model’s out of sample predictive accuracy.

L1-regularization and sparse priors: design your analysis so that exactly 0 coefficients are favored, do model selection automatically. Use cross-validation to determine an appropriate level of regularization.

The unifying theme is that you need some way to determine the appropriate level of complexity by balancing number of free parameters and goodness of fit. Frequentist tests usually try to get at this by a) really discouraging exploratory and post-hoc analyses and b) setting the decision rule such that the simpler model is the null and proving a more complex hypothesis is hard.

]]>Sabine: You are not completely alone – for instance see the 9_Greenland_Senn_Rothman_Carlin_Poole_Goodman_Altman.pdf in the ASA supplement http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

Now, their explanation for not abandoning seems to be the lack a currently agreed upon better alternative (i.e. growing old is not great, but better than the alternative.)

For instance, this comment “Bayesian statistics offers methods that attempt to incorporate the needed information directly into the statistical model; they have not however achieved the popularity of P-values and confidence intervals, in part because of philosophical objections and in part because no conventions have become established for their use.”

There really needs to be conventions [understanding] established for their use.

I do believe if any alternative is not well sorted out and explained widely to statisticians and others in this regard, in X years we will just have another as long or longer guide to misinterpretations. (The issue is getting at what CS Peirce called the pragmatic grade of understanding of the concept, beyond just the proper definition and identification of it.)

]]>Also, if you run many replications, and if you repeatedly get p<0.05, then you can start to be sure that your effect might be real. But this information is not coming from the p-value; it's coming from the replications. So, in the interesting case where you can replicate your effect, the p-value is giving you nothing, it's the consistent sign and magnitude of the sample mean that is giving you something. (This is what Andrew calls the "secret weapon".)

]]>Well, once you’ve rejected the null that mu=0, there are an infinity of possible values implied by mu!=0. If you did a power pose study and rejected the null that mu=0, would you be willing to write that we have evidence that the effect of power posing is to increase or decrease mu? No, of course not. You look at the sign of your sample mean and if it is positive, you conclude that power posing increases mu. However, the p-value didn’t give you that. The p-value only helped you reject the hypothesis that mu=0. The next step, to argue for your favored alternative hypothesis, that mu is positive, is not based on the null hypothesis test itself. Also, when you do a single statistical test,and get p=0.0001, you don’t know if you are one of three possible worlds (a) null is actually and you just got “lucky” (b) null is false but you are in a low power situation, in which case you are in danger of suffering from Type S and M errors (wrong sign, exaggerated effect), or (c) you are in a high power situation (in which case you are golden). If all our experiments involved situation (c) life would be good. But they don’t. It’s more likely with Bem, Cuddy, the red color and sex studies that mu is very close to zero (just because it’s implausible to think that people have ESP, and that life is just a matter of waving your arms in the air, or that there are simple outward indicators of biological events), so we usually end up in situation (a) or (b), which means that most published studies are just publishing noise.

]]>Can you clarify, don’t model comparison approaches still involve null hypothesis testing (e.g., that the fit of one model is surprisingly better than the fit of another, under the null)? How else are the models evaluated with respect to one another?

]]>I understand your points that these things are rarely done, but what I’m asking is whether it is the case that when these things ARE done, does someone taking Andrew’s position still find p-values worthless. I use power posing as a hypothetical example (i.e., assuming a large sample, a prespecified hypothesis and appropriate model to test it, no changing the data analysis upon seeing the data, etc.). I’m just seeking to understand why someone would not find value in p-values when these conditions are met. Specifically, if it comes down to the null being a straw man, I don’t see why that is so in a scenario like this. Is contrasting a hypothesis against the null really so weak? I know it’s argued that this is weak because the null is never really true, and if that’s what he’s getting at, I’d be interested in knowing. I’m not sure I find that a compelling argument to abandon p-values but I’m willing to entertain the thought.

]]>PS

Daniel Lakeland just gave a good one-line summary of the problems with p-values on another thread (http://statmodeling.stat.columbia.edu/2016/03/10/good-advice-can-do-you-bad/#comment-265719):

“Garden of Forking Paths” means little more than “p < 0.05 is easy to find"

]]>Sabine,

Nobody’s blaming the p-values.

But saying, “They allow you to indicate how improbable your results are under the null, and take this as support for your theory” misses many important points. To list just two:

1. The p-value depends on the model. So to rationally justify using a p-value as support for a theory, you need to provide a good reason why the model adequately fits the theory and the question being asked about it. This is rarely done.

2. The p-value depends on the sample size. So at the very least, you need to consider what sample size gives you a good chance of detecting a difference of practical importance. This is also rarely done.

In particular, I don’t think either of these points was addressed in the power pose case that you are using as an example.

]]>Okay, I understand the view that p-values are widely abused, but that’s not the p-values’ fault. They still seem to have utility when used correctly. They allow you to indicate how improbable your results are under the null, and take this as support for your theory. What I am really wanting to know is why Andrew thinks this is uninformative, and really my question is specifically asking about his view that looking at the improbability of one’s results under the null is always or even often so deficient as to be completely uninformative.

]]>Model comparison approaches are your best bet. Comparing multiple plausible models allows for more in-depth analysis of strengths and weaknesses and avoids the straw man problem. There are Bayesian, information theoretic, machine learning and even frequentist tools for this.

]]>I’m not aware of (and doubt that there is) any “objective” way that we can infer anything probabilistic from averages. However, any individual is free to use results about averages to come up with a subjective probability or otherwise use a result about averages in their own personal decision making.

]]>So you are saying we can’t infer anything probablistic from averages? This seems extreme. Of course power posing could be terrible for me and my self-concept, but insofar as it seems to be generally beneficial for a sample that adequately represents me, I think it would be reasonable for me to give it a try (if I care to improve my feelings of power).

]]>More simply put – we don’t solutions but rather just (hopefully sensible) ways to struggle through observations we some how get.

Wouldn’t that make a great motto for a statistical society?

]]>> just fit a Bayesian model and report the posterior and let your detractors deal with it

Now, if we only had a clear understanding of what to make of posterior probabilities that could be widely communicated…

To the general scientific community!

I like the analogy of science Peirce offered of standing in a bog where the ground seems secure ready to move when we realize its giving way. Statistics is just the science of inductive inference, informed/enabled by math, and I believe the same analogy is apt. There are not sure methods/solutions available, rather aspects we think for now we sort of get not too wrong. Unfortunately, the working in math continually seems to mislead many of us that somewhere/someday there will be sure methods/solutions (e.g. Dennis Lindley’s interview Tony O’Hagan where he talked about finally making statistics a rigorous axiom based subject like all other areas of math.)

]]>“Am I naive to think this would count as evidence that brief power poses can help make me feel better about myself and that I should try them?”

How about just getting and plotting estimates of the mean and the uncertainty intervals with very large sample sizes and a repeated measures design, from many replications? What does the p-value add beyond the information you would gain from those summaries?

The p-value is just a ritual incantation we use to justify our journal article’s existence. I have recently been reading some papers from an author (a winner of millions of Euros through grant) who literally made up the p-values, as in the published p-values are not even remotely related to the published t-scores (I’m not talking about rounding errors, but things like t=0.1, p<0.001).

Also, I've recently been reading several papers published in top journals where *none* of the effects in the planned comparisons were statistically significant but the authors p-hacked their way to significance, using various tricks. Basically, all you have to ensure that there is *some* p-value somewhere that is below 0.05. I have done this myself, in an earlier phase of my life.

The p-value just lends scientific credibility to the statement, "hey, I'm right after all". If the p value falls above 0.05, but only just, there are some 500 euphemisms available in psychological sciences for arguing that yes, I was right (when I am in that situation, I write, "did not reach the conventional level of significance", but I only write it as an insider joke now). There is no possible universe in which Gilbert or Cuddy or anyone else would publish a paper saying, guys, I can't find evidence for my theory. Try naming some people in your field who published evidence against their own theory. In a 30-40 year career it is statistically impossible that a researcher will never find evidence inconsistent with their theory (even if it's a mistake).

So, if there is only a unidirectional outcome possible with the p-value, why bother? Just publish your means and SEs, as a sort of poor man's Bayesian approach, and move on. Or just fit a Bayesian model and report the posterior and let your detractors deal with it.

]]>Sabine,

Sorry to say, but yes, I do think you would be “naive to think this would count as evidence that brief power poses can help make [you] feel better about [your]self and that [you] should try them”.

One reason: The type of “findings” you discuss are about averages (or some similar overall measure), not about individuals. So even if power poses on average help people feel better about themselves, we can make no conclusions about their effect on individuals; some individuals might even feel worse.

]]>Just to clarify, you said above:

“I think the problem comes because researchers are typically not using p-values or hypothesis testing to test a model they care about. That is, they’re not doing stringent testing or severe testing or Popperian reasoning or whatever. Rather, they’re rejecting straw-man null hypothesis A as support for their preferred alternative B.”

I haven’t read enough of your blog to know how you define a straw-man null hypothesis (feel free to point me to a relevant previous blog post), so I’m trying to understand why you think p-values and the like have to go. It seems to me that p-values are okay when used appropriately by scientists who actually care about testing a hypothesis/theory versus telling a splashy story that will advance their career/make them a celebrity. But you are suggesting that even if one follows the rules and takes into account analyses they would have done had the data been different, that’s still not enough to make p-values useful in science. Are you saying this because you think they are always likely to be abused or because you think they really can’t ever tell us anything informative?

Let’s say Amy Cuddy’s finding was actually a true effect and that the steps taken to test the hypothesis/analyze the data were completely appropriate, and the p-value was significant. And let’s say the finding was replicated multiple times. Would you still find this completely uninformative? Sure, the null hypothesis is no effect of power poses on psychological and behavioral states, but is that really that lame/deficient? What would be a better alternative? Am I naive to think this would count as evidence that brief power poses can help make me feel better about myself and that I should try them?

]]>I think a counterexample might be the PACE trial that Andrew wrote a lot about earlier this year (e.g. http://statmodeling.stat.columbia.edu/2016/01/13/pro-pace/). You could have a lot of subjects, you can randomize them and blind everyone to conditions, etc. But if you have multiple potential measures of interest (and particularly if you might be willing to change your mind about which ones are “important”), then you have the opportunity for forking paths. On the other hand, if you take Andrew’s advice to model everything, then perhaps you would see that even the selected-for-p value-significance effects aren’t that impressive.

]]>Agree, but think people need to read through all the submissions in the supplement – which I have started on but not completed ;-)

These two comments seemed to be pulling in concerns that I would like to try to summarize at some time, but for now –

re:the first comment, very informed scientific reasoning is needed to arbitrate the assumptions required so the the p_value is an any way sensible (or any other approaches).

And for the second, inference (making sense of others empirical research even if one was involved) is not just challenging but something likely beyond most statisticians – given it is seldom if ever being addressed in today’s Msc/Phd stats training (see Don Berry quote above).

I think the challenge in making the report was to avoid be too clear about what is beyond statistics and what else is beyond most of today’s statisticians – in what it is expected/desired that p-values and other statistical math techniques should/can provide (in the hands of most of today’s statisticians) to help make sense of empirical research.

]]>Andrew, how much impact did Red State Blue State have? My feeling is that anyone who implements Andrew’s recommendation here (to embrace uncertainty and variation) will get a big fat rejection from a journal article submission, unless someone enlightened about the issues is reviewing it and editing it (almost never the case). Even when embracing variation, one has to somehow engineer a story, which inevitably takes us into speculation dressed up as a certainty.

]]>I think that first in human Phase 1 safety trials approach this. The subjects are generally young males in good health. They are living in a lab because they get so many tests and can all eat the same food. Also the studies are often set up as crossover designs so blood levels are compared within the same subject. This data is used to show the pharmokinetics of the test drug and also to flag any unexpected lab results.

Once the drug is given to sick people, confounding and unexpected results do become a problem. But drug studies have extensive protocols and analysis plans that give direction on how to handle data.

]]>