Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit

People point me to things on the internet that they’re sure I’ll hate. I read one of these awhile ago—unfortunately I can’t remember who wrote it or where it appeared, but it raised a criticism, not specifically of me, I believe, but more generally of skeptics such as Uri Simonsohn and myself who keep bringing up p-hacking and the garden of forking paths.

The criticism that I read is wrong, I think, but it has a superficial appeal, so I thought it would be worth addressing it here.

The criticism went like this: People slam classical null-hypothesis-significance-testing (NHST) reasoning (the stuff I hate, all those “p less than .05” papers in Psychological Science, PPNAS, etc., on ESP, himmicanes, power pose, air rage, . . .) on two grounds: first, that NHST makes no sense, and second, that published p-values are wrong because of selection, p-hacking, forking paths, etc. But, the criticism continues, this anti-NHST attitude is itself self-contradictory: if you don’t like p-values anyway, why care that they’re being done wrong?

The author of this post (that I now can’t find) characterized anti-NHST critics (like me!) as being like the diner in that famous Borscht Belt routine, who complains about the food being so bad. And such small portions!

And, indeed, if we think NHST is such a bad idea, why do we then turn around and say that p-values are being computed wrong? Are we in the position of the atheist who goes into church in order to criticize the minister on his theology?

No, and here’s why. Suppose I come across some piece of published junk science, like the ovulation-and-clothing study. I can (and do) criticize it on a number of grounds. Suppose I lay off the p-values, saying that I wouldn’t compute a p-value here in any case so who cares. Then a defender of the study could easily stand up and say, Hey, who cares about these criticisms? The study has p less than .05, this will only happen 5% of the time if the null hypothesis is true, thus this is good evidence that the null hypothesis is false and there’s something going on! Or, a paper reports 9 different studies, each of which is statistically significant at the 5% level. Under the null hypothesis, the probability of this happening is (1/20)^9, thus the null hypothesis can be clearly rejected. In these cases, I think it’s very helpful to be able to go back and say, No, because of p-hacking and forking paths, the probability you’ll find something statistically significant in each of these experiments is quite high.

The reason I’m pointing out the problems with published p-values is not because I think researchers should be doing p-values “right,” whatever that means. I’m doing it in order to reduce the cognitive dissonance. So, when a paper such as power pose fails to replicate, my reaction is not, What a surprise: this statistically significant finding did not replicate!, but rather, No surprise: this noisy result looks different in a replication.

It’s similar to my attitude toward preregistration. It’s not that I think that studies should be preregistered; it’s that when a study is not preregistered, it’s hard to take p-values, and the reasoning that usually flows from them, at face value.

133 thoughts on “Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit

  1. The usual NHST (hereafter referring to that using a nil null hypothesis not derived from any theory) makes no sense, that is why the p values generated by that procedure are used and interpreted incorrectly by so many people. There is no right way to do NHST. People have convinced each other they need to perform these “meaningless, pedantic calculations”, so there simply *must* be some use for it. That is the initial error that leads to all others.

  2. The original Jackie Mason comedy hour concerned the common fallacy of taking a statistically significant result at level p as more impressive when it results from a larger than a smaller sample size. If you tell me the food covers k% of your dish, I need to know how big the dish is to infer how large the helping is.

    I’ve no time just now to do more than list some relevant problems with some of today’s p-bashing:
    The p-value critics are happy to trust p-values when they show results too good to be true and a variety of others QRPs—even when the upshot is the loss of someone’s career.

    “P-values can’t be trusted except when used to argue that p-values can’t be trusted”.


    Then there’s the replication paradox: It’s claimed to be too easy to get low p-values, but it’s found to be hard to get them with preregistered results (showing the problem isn’t with p-values, but failing to adjust for biasing selection effects). Then critics, many of them, turn around and advance methods that are insensitive to biasing selection effects!
    “The paradox of replication and the vindication of the p-value”


    That’s “throwing out the error control baby with the bad statistics bathwater”:

    Besides, if you don’t have statistical tests, you don’t have falsification:

    “If you think it’s a scandal to be without statistical falsification you will need statistical tests”

    Nor do you have most of the methods of fraudbusting and statistical forensics:
    “Who ya’ gonna call for statistical fraudbusting”

    Finally, critics imply (without specific argument) their own account & rival accounts avoid the problems to which significance tests are open: a “Dale Carnegie Fallacy” (at the end of this post):


    • >’Besides, if you don’t have statistical tests, you don’t have falsification:
      “If you think it’s a scandal to be without statistical falsification you will need statistical tests”’

      You seem to be continually talking past the problems discussed on this blog. As explained many times, the #1 problem is that the “hypothesis” being falsified in NHST is worthless. The only use for its falsification is to subsequently make affirming the consequent errors, which most people intuitively know are wrong. They actually need to have it pounded out of their heads in stats class and during dissertation proposal reviews. NHST is so dumb that people make up myths about the meaning of pvalues rather than believe all those highly trained folks have been wasting each others time with such nonsensical rituals. That is the truth about what has been going on.

      Now, change the hypothesis to something derived from a theory/model and it is a totally different story.

      • Yes, Mayo often derides the affirming the consequent errors, but fails to acknowledge how widespread and important that is in actual practice.

        I think we can all agree that *if a real scientific hypothesis predicts a frequency distribution with which certain things should occur, and then you test that frequency distribution using a statistical hypothesis test and find that it doesn’t occur, then you have done something actually scientific* namely you’ve statistically falsified a prediction of a theory. But, the theory must predict a specific frequency distribution, for example “lottery drawing machine A produces uniform draws from all the balls in the machine” or “decay of a Foo particle produces Bar particles precisely F percent of the time” or “genetic mutation X produces biological outcome Y in 13 +- 3 percent of the population, biological outcome Z in 38+- 4 percent of the population and biological outcome Q in 49 percent of the population”

        The fraction of the time that these specific frequency predictions are part of a theory and are the things being tested in NHST in actual academic papers is pretty negligible.

        • “Yes, Mayo often derides the affirming the consequent errors, but fails to acknowledge how widespread and important that is in actual practice.”

          Exactly, as linked from and discussed here* I have pointed out to her multiple times that at least two major founders of NHST (Student and Neyman) make the same affirming the consequent error when they attempt to apply it to real world problems. The reason is that the *only* use of NHST is to subsequently commit that logical error.


      • “P-values can’t be trusted except when used to argue that p-values can’t be trusted.”

        As I understand this statement (and the blog to which you link below it), it’s a little misleading.

        The problem with p-values lies not specifically in the act of measuring statistical significance. It lies, rather, in the cloaking of the question. Those who believe in p-values, or who haven’t questioned their use, assume the null hypothesis is rejected if the p-value is low enough. They glance at the p-value, think, “Oh, ok, good,” and move on.

        To criticize the use and misuse of p-values is not to criticize all measures of statistical significance. It’s quite possible to employ and analyze measures of statistical significance without using p-values (and the typical gullibility surrounding them).

        I am speaking as an amateur here, but I think Daniel Lakeland just made a similar point from a much more informed perspective. I am saying this anyway because it’s important to me to get a handle on this stuff. The misunderstanding and exploitation of p-values is so widespread it isn’t funny.

        • Mayo’s comment (“P-values can’t be trusted except when used to argue that p-values can’t be trusted”) is a good deal more silly than that.

          It’s like me selling you a brand new blood test machine (call it the Thoranos if you like). You decide to test the reliability of this machine. So you take a pinprick of your blood and the test comes back positive for cancer. One second later you use the method again on a second pinprick and the test comes back negative for cancer. Then we have the following conversation:

          You: “I don’t have cancer, your machine is unreliable”.

          Me: “Ooooh, so my machine can’t be trusted unless it’s used to prove my machine can’t be trusted”

          • I may be over interpreting Mayo’s post, but my understanding of what she is saying in that post is that there are two potential problems with p-values:

            (1) p-values are not used properly by all (or even most?) researchers
            (2) p-values are in no way useful, even when used properly

            I don’t think there’s a statistician out there that will argue with (1). However, I think Mayo (and I) think that a lot of people that bash p-values provide evidence for (1) for and then turn around and claim they have shown (2).

            The distinction is important! If the problems with statistics, as applied to confirming scientific hypotheses, is (2), then we should work to replace the use of p-values with something more rigorous. On the other hand, if (1) is the cause of most of the problems in practice, then perhaps it’s more important to move toward simplicity. One could argue these are opposite directions!

            • (2) is false. p values are quite useful in testing whether a given long sequence of data values does or does not come out of a given random distribution. When you have a hypothesis “this data set came out of a particular random process” you can use testing to see if that is true or not. This is true essentially by definition of a random sequence (one which passes very strong statistical tests for randomness, see Per Martin-Lof’s paper I discuss here http://models.street-artists.org/2013/03/13/per-martin-lof-and-a-definition-of-random-sequences/ ) however, outside of that well defined purpose, it doesn’t seem there is much else we can do with tests. This is because of how tests are defined.

              I give examples of how you would use p values legitimately based on the above idea at this blog post:


              The p value is quite simply defined as the probability of seeing a test statistic more extreme than the one actually observed if the data were a random sample from a given distribution. It’s very literally about testing whether your data looks strange compared to what a random number generator would produce!!! If you’re randomly sampling some values from nature (ie. you’re selecting at random a bunch of 1 second long seismometer recordings from a large number of continuous recording seismometers) then it’s legit to ask whether a particular subset of the recordings is “strange” compared to the others. Filtering data (not filtering hypotheses). That’s a good use of p values.

              • And what about the case of clinical trials? Consider that despite lots of research and investigation before the actual phase III trial, a very large number of trials fail to conclude efficacy. This is based on a p-value.

                But what if we had done a Bayesian analysis instead to get the posterior probability of a practically significant effect? In order for a company to decide to sponsor a drug, they must have a lot of evidence that it will work. Thus, there is no way that they do not start with a very strong informative prior, in support of efficacy. Likewise, the FDA will not approve a trial unless they believe it has a good chance of being successful. So the FDA must have a strongly informative prior supporting efficacy as well (probably because they listened to the drug company).

                But let’s revisit the fact that clinical trials fail at very high rates! This is strong evidence that, quite frankly, the prior knowledge is consistently very wrong (my thought: because people doing the pre-phase III work are not doing as rigorous analyses and understand the idea that good looking results = raises)! But surely had we included them in these trials, we would have approved many more drugs.

                So it seems reasonable to me that there are some situations in which you want to see how strongly the data (and the data alone) measure the evidence against a null hypothesis.

                With that in mind, I find the whole “p-values and confidence intervals are only meaningful in a world where an experiment is repeated a large number of times” to be a bit of a straw man argument. If you have the ability to abstract probability to be your personal belief, then surely you have the ability to abstract confidence to mean “I ran a procedure that works correctly 95 percent of the time, so I’m 95 percent confident that it worked”.

                Finally, don’t get me wrong, I love priors! There are plenty of analyses that need them (for example, see pre-Bayesian optimal design, and how hacked together and non-optimal it is in comparison with Bayesian optimal design). But my problem with p-bashing is that I find it often to be too much of a straw man argument without proposing any real solutions to the real difficult problem; how do you make statistics accessible, usable and transparent to people with a very deep depth on knowledge in some topic other than statistics?

              • Cliff,

                From a Bayesian perspective, we should do a trial if we believe in a positive expected value for the outcome. That need not mean that we have a strong prior about the outcome definitely being positive (that is, a concentrated prior around a single value), only that the range extends far enough into the positive that we think positive is part of the high probability region, and positive is where the overall average works.

                For example, on some scale where 1 is a very good practical effect, if we have normal(0.1,3) prior, we should do our trial, but our prior suggests a 50% chance that the effect size is less than 0.1 and includes negative outcomes of size -1, -2 etc in the high probability region.

                Furthermore, the decision is made not on how large the expected effect is, but on how large the expected *monetary profit* is. If a tiny edge over some other drug means you get 100% of the Rxes you could be making billions a year for not much effect. Further, the decision is made based on the actual regulatory methodology in place, which is p values not Bayesian estimates, and p values have a “lottery” aspect to them. So you might well get lucky and make billions even if nothing much is really going on (just ask some of the big cancer drug manufacturers).

                So, although your argument sounds plausible at first, there’s really no reason to believe that A: we’re required to use the drug company’s skewed and artificially concentrated priors, or B: that the fact that the company is doing the trials means that the drug company really has any kind of strong priors on the size of the effect their drug will have (where here I mean “strong” as “concentrated about a single value”).

                Also, clinical trials fail at very high rates under current methodology. That current methodology is very politically skewed, and does not involve making decisions using real-world tradeoffs between costs and benefits, it involves making decisions using thresholds for significance.

            • One thing that puzzles me is – what constitutes “using p-values properly”? As Fisher originally intended? In a Neyman-Pearson deecison framework? (Neyman hated that!) NHST? something else?

              • Simon:

                Neyman may have said he hated the use of p-values in a decision framework, but on at least one occasion it seems that he did, without careful reflection, do that NHST thing. See the quote here, from 1976:

                The application of the chosen test left little doubt that the lymphocytes from household contact persons are, by and 1arge, better breast cancer cell killers than those from the controls.

                Back then, NHST was such a dominant mode of thinking that even people such as Neyman who hated it, seemed to be using it without thinking twice. In this case Neyman was following the logic of demonstrating the truth or plausibility of favored hypothesis B by rejecting straw-man null hypothesis A. That’s bread-and-butter NHST, it’s the think that Meehl was criticizing decade ago and which I and others are criticizing right now.

      • Non anon, I meant genuine falsification of scientific hypotheses. Because of limits and uncertainties in predictions, there’s very often a statistical prediction (which meets witha statistical induction). Then there’s the testing at the other end of the spectrum: testing assumptions of statistical models.

    • Mayo, not having read in detail all of those posts, I will say that I think the statement “if you don’t have statistical tests, you don’t have falsification” is just flat out wrong if you insist that “statistical tests” = NHST.

      A Bayesian can falsify a statistical hypothesis by putting that and all the competing hypotheses into a calculation in which each hypothesis has some prior probability and then looking at the posterior probability associated to the hypothesis. Although there might come out of it something that looks like a p value “p(Hypothesis_A)” that quantity is a posterior probability which is very different from an NHST p value.

      Posterior Probability of Hypothesis_A: “The probability under the assumptions of the model that the data generating mechanism A is at work rather than the various B,C,D etc which were also considered in the calculation.”

      NHST P value for Hypothesis_A: “The frequency with which repeated application of the test to data actually coming from the data generating mechanism A would give answers more extreme in the test statistic than that actually observed in the data”

      So, it’s perfectly possible to show that we should view A as implausible under a Bayesian only calculation without ever resorting to a repeated application frequency based statistical test.

      • Daniel, I really enjoy reading your contributions to Andrew’s blog and I have also read some posts on your own blog. Could you please elaborate a little more on your position about using a ‘model indicator’ variable for posterior inference? I mean it in an epistemological way, considering the following:

        Andrew has stated clearly his position against most hypothesis testing paradigms – no matter if it is NHST, BF or whatever. Here in the blog and in BDA, he defends a continuous model expansion method: design the model based on some substantive theory and useful approximations, check where the model doesn’t fit the data in meaningful ways and modify it as needed. The usual worry about overfitting in model search is somewhat lessened by the way Bayesian inference regularizes the estimates.

        I like Andrew’s position because it sounds pretty Popperian to me: formulating conjectures (model building) can follow many strategies and it is not really part of the logic of scientific discovery; what matters for corroborating a model is how well it survives by making predictions and comparing them to data.

        Posterior inference on a set of models, on the other hand, seems to be the kind of reasoning that bayesian epistemologists use to justify the bayesian paradigm as a complete description of the scientific method. But it clashes with Popper’s view in many ways: first, Popper clearly states that theories with high explanation power should, by definition, have low probabilities, so we should seek theories that explain more in contrast to theories that have high probability; attribution of probabilities to theories is clearly difficult, not because frequentist says we shouldn’t attribute probabilities to fixed truth values, but because the model space is too big for any interesting model to have non-negligible probability. There’s also the issue of ‘enumerating’ all possible models, which is impossible, but ignored when we do posterior inference on model indices.

        Now, I’m not saying that Popper has the last word in the definition of the scientific method, but I really like his views and I’m trying to understand better how to make an agreement between bayesian methods and Popper’s position.

        • Erikson, thanks for your kind words. I’d be happy to elaborate.

          First off, I consider continuous model expansion and discrete model expansion as two sides of the same coin, namely considering many alternatives and filtering them down to find which ones can withstand comparison to data.

          I fully agree with the things you mention about Andrew’s position in your second paragraph, and I don’t think they are actually at odds with anything I suggest. On the other hand, in science it seems to be frequently the case that there are competing classes of theories. For example one theory might say that there’s an explanation for a certain kind of cancer involving exposure to certain chemicals in the environment, and another theory might suggest a certain behavioral or life choice issue (diet, or age at first pregnancy or whatever).

          In these contexts where there are substantive disagreement about different classes of mechanisms, you should consider both classes. Suppose we have HE the hypothesis of the environment and HB the hypothesis of a behavioral choice. Each one of these might within it have a whole range of continuous parameters that describe how the hypothesis predicts what data will occur. Bayes makes it possible to do the math consistently in the continuous or the discrete case, or the continuous within the discrete case… any case really. In particular, you can do something like this, where I’m being pedantically explicit:

          P(Data | K(HB,HE)) = p(Data | HE, HEparams,K(HB,HE)) p(HEparams|HE, K(HB,HE)) p(HE | K(HB,HE)) + p(Data | HB, HBparams,K(HB,HE)) p(HBparams|HB,K(HB,HE)) p(HB | K(HB,HE))

          where K(HB,HE) is the knowledge we have about whether HB or HE might be true and what that implies for the mathematical form of the likelihoods and priors etc

          Now, suppose p(HB | K(HB,HE)) is taken to be q and p(HE | K(HB,HE)) = 1-q and q is given a hyper-prior of beta(1,1) (uniform on 0,1).

          We put all this into Stan, and pull samples. The posterior samples for q cluster around 0.97, we take this as evidence that HB predicts the observed outcomes much better than HE does. On the other hand, we could make q have beta(100,1) in which case we already believed q was close to 1, and if q winds up close to zero (so that HE seems plausible), we know that the data strongly favors this environmental explanation even though we thought initially that the environmental explanation was hocum.

          Of course, it’s very likely in many fields of study that we don’t get decisive tests of theories. But thinking about the Bayesian method, you can at least see how it explicitly instructs you to look for experiments where one theory predicts a certain kind of outcome where all other theories would predict something else. And that helps you define the decisive experiment or observation. See Laplace’s comment on how that works.

        • So, in some sense, my example with q a continuous parameter allows you to take two example explanations and treat them in a continuous model expansion way. As for the impossibility of enumerating all possible models. I think this is more a logical issue than a practical issue. In science we have some popular models of how something works. Rarely do we have more than 10, much less say 10^6 different models. So put them all in and may the best model win.

          However, from a logical perspective, there are all these strange models “out there”. For example, how about the set of all Stan models that correctly compile and run on the given data whose code size is less than 10^12 bytes, modulo models that are identical except for name changes of the parameters…. This is still a ridiculously large number of models. There are 256^(10^12) different models, of course a huge number of them won’t compile. but the logarithm of the number of raw possibilities 10^12*log(256), so the number is around maybe 5 trillion digits long.

          So, there’s a logical concern, but in practice there are only ever a few models being discussed in the scientific literature. Is this a philosophical problem that needs an answer? Not any more than the question of why when I take a bite of chicken sandwich does it not transform in my mouth to a piece of banana? Logically it could, the relative number of oxygen, carbon, hydrogen, and trace mineral components could well be the same so that a bite of chicken sandwich and a bite of a banana when say fully burned in a bomb calorimeter could be reduced down to the same final set of simple molecules, but I don’t worry too much that it doesn’t happen on a regular basis and I don’t feel a need to put the chicken-banana transform into each of my high school chemical reaction lectures or whatever.

          Practically speaking, if we’re looking at a small set of competing models within a scientific community, we can work with Bayesian methods to rule out those that don’t correctly predict the data. If in the end we still have competing hypotheses with nontrivial probabilities, it’s an indication that we might need to look harder for alternative explanations or decisive experiments.

          • I don’t think every model really needs be considered, only those with “high enough” p(H)*p(O|H). Think about Bayes’ rule:

            p(H[0]|O) = p(H[0])*p(O|H[0])/(sum( p(H[0:n])*p(O|H[0:n]) )

            H[0] = some specific hypothesis of interest
            H[1:n] = the “other” 1 to n hypotheses
            O = the observations

            Imagine we have sorted all the H[i] according to the value of p(H[i])*p(O|H[i]). It is clear that at some point the effect of H[i] on p(H[0]|O) becomes negligible and can be dropped from the denominator. For most use cases it really only the first few terms that need to be considered down there.

            • Absolutely. This is more or less the asymptotic approximation. Whenever you add a new model in you need to recalculate because you truncated it out earlier. Not a problem so long as you understand that a+b+c+ eps ~ a+b+c but only approximately.

              • Thanks, Daniel, for the exposition!
                I think you and Anoneuoid have given a really interesting mathematical solution to philosophical problem.

              • >”I think you and Anoneuoid have given a really interesting mathematical solution to philosophical problem.”

                The insights that drop neatly and easily out of Bayes’ rule are amazing. I continually marvel at the fact people at the highest levels of education are taught absolute nonsense like NHST with zero mention of Bayes’ rule at all. That is one of the biggest scandals of our age. I suppose maybe it impedes nuclear proliferation, etc.

          • RE asymptotic/perturbation approximations. Only works for regular perturbation approximations, not singular. Unfortunately singular phenomena and bifurcations are everywhere when applying mathematical models to the world (or just in general). Philosophically you can make the analogy to normal science vs paradigm shifts.

            Andrew has effectively made the same points numerous times in numerous ways, but people still subscribe to ‘Naive Bayes Solves Everything’ (NBSE). It seems no more ‘philosophically justified’ than the notorious NHST.

            Of course most of this is orthogonal to the main issues at hand, but I think it’s worth repeating that these hand waving explanations are not philosophically or mathematically well justified.

            Bayes/probability theory is a useful tool but is not magic.

            • We’re not talking about asymptotic approximations *within* the models, we’re talking about asymptotically approximating p(Some_Wacky_Model) as 9 in a straightforward sum in which we know there are already terms that are orders of magnitude larger than it, so it is in fact a regular approximation, at least at first.

              When p(Data | Some_Wacky_Model) is large and p(Data | AllTheOtherModels) is small, then even if p(Some_Wacky_Model) started out small, the posterior p(Some_Wacky_Model|Data) can be large. Then the approximation does become singular, which is exactly the case where we’ve already mentioned that you’d need to “go back to the drawing board” and include the extra model you’d previously ignored. In fact, the point of bringing up asymptotic approximations in this context is precisely to justify both why we don’t write down models that include 700 billion alternatives, and why it’s perfectly logical and justified to “go back to the drawing board” and pull in more models after your initial try didn’t work. So, yes, the failure of a perturbation approximation in a singular case is exactly the justification for “back to the drawing board”.

              Bayes solves everything that it’s intended to solve. It doesn’t solve *everything* but then calculus doesn’t solve the question of whether P=NP either…. on the other hand, it’s not supposed to.

              • OK, I think we can roughly agree on most things here, give or take.

                I just think that it then appears obvious that, as Andrew has said, “all the standard philosophy of Bayes is wrong” (and I think this includes Jaynes).

                Which is not to say it is not useful, just that it is boringly limited. Just as basic calculus is compared to analysis and the rest of mathematics.

                So, to continue the analogy, getting more people to use calculus instead of arithmetic is useful but doesn’t mean they have therefore ‘ mastered’ mathematics. The same is true of Bayes/probability theory and ‘scientific inference’.

                Just as we often fail to communicate to the public that calculus is only the very tip of mathematics and mathematical thinking, perpetuating Bayes as the pinnacle of scientific/statistical inference leads to a limited view of the full extent and complexity of what is involved.

              • Well, I’m not sure “all the standard philosophy of Bayes is wrong” or not. I don’t think most of what Jaynes says is wrong. It’s important to remember what Bayes does and doesn’t do. It does let you assign numbers that represent knowledge about whether something is or is not true by degrees. It doesn’t make it so that you can easily decide on what should or shouldn’t be considered in your models… that’s up to *science*.

              • Like everything it lets you do something on the basis of other quite strong assumptions and has little to say about whether it’s something you want to do, whether those are good assumptions or what happens if those assumptions are varied slightly etc.

                Eg ‘reasoning perfectly given strong assumptions’ is not necessarily better than ‘reasoning imperfectly given weaker assumptions’. Asking the ‘right’ questions – eg choosing what to include in a model, what sort of model you use, checking a model etc – is very often more important than how you (say) estimate parameters given a model.

              • One huge huge advantage to teaching people Bayes is that it gives you a straightforward computational tool to use pretty much *any model you think makes sense*

                The big advantage to Bayes isn’t that the math of Bayes is some kind of magic, the big advantage is this flexibility to do real science (defined as coming up with explanations for the mechanisms of how the world works, and then testing them against data). Two simple mathematical rules (sum rule, and product rule) give you all the tools you need for inference given a model. Bayes doesn’t tell you anything about how to build models of the world, but it will slot into the job of inference regardless of what models you choose.

      • The way this plays out in the Bayesians equations is exactly the way scientists used to make progress before p-values and the current crises. To summarize:

        P(observed | H_0) being low is suggestive there’s a problem with H_0 because it makes it very easy for some other other H_i to beat it the sense that P(H_i|observed) is greater than P(H_0|observed).

        But to actually get a very high P(H_i|observed) close to 1 you need observations which *simultaneously* make P(observed|H_i) high while making all other P(observed |H_j) for j!=i very low. You could colloquially call such observations a “severe test” of H_i if you like because of the way they truly separate H_i from the pack of possible explanations.

        To see this play out in real life consider when astronomers get a signal consistent with alien life. So P(signal|alien life) is relatively high. They then begin an exhaustive search (sometimes lasting for years) through all the other down-to-earth explanations H_1, H_2, …. which might also explain the signal.

        Often times, it’s possible to identify one of these (say H_1) as having much higher P(H_1|signal) than all the rest put together. The astronomers will then announce there isn’t an alien signal it was just H_1.

        Other times, P(H_i|signal) is still relatively high for several hypothesis. “Signal” just isn’t enough to clearly separate out the true cause. The Astronomers will then collect new data which is a *severe* test in the way mentioned before. It separates out one hypothesis from the others by making P(new_data, signal|H_i) very high for one hypothesis and very low for all the test. In that way they hone in on the true cause and get a P(H_i|signal, new_data) close to 1. They then announce they’ve found the cause.

        Frequentism-Fisher-Neypman-Perarson-P-values et al short circuited this process completely because for purely philosophical reason they wanted to only look at P(observed|H_0) and not consider P(H_0|observed). The later can be low even when the former is high and vice versa!. Their fundamental philosophical errors made this disaster happen. It wasn’t bad teaching or anything else. They got it wrong at a very fundamental level and refuse now to admit it.

        This is yet more reason why no one should ever listen to philosophers who couldn’t math their way out of a paper bag.

        • The amusing thing is Mayo seems to have intuited a bit of this. So she created her “Severity” concept and tried it out on literally the simplest examples possible where it happens “Severity” is identical to the Bayesian posterior and so works as I described above.

          Any competent mathematician would have then immediately looked for examples where “Severity” does something different from Bayes to see which is right. If you do so you’ll run almost immediately into examples where Bayes is obviously still doing the right thing while the Frequentist inspired “Severity” generates nonsense.

          Not doing that simple check though has the advantage that one can parlay this little bit of effort into an a successful/rewording academic career. God, how I despise academia.

          • > successful/rewording academic career
            Typo of the month – or did you mean rewording as being more apt than rewarding?

            On the other hand, I do think Box could philosophize his way out of mathematical distractions and he did think one did need to step out of Bayes to check priors and data models. So did Rubin, Gelman, Evans and others…

            • My subconscious chokes whenever contemplating how much money was wasted on dumb statistical ideas.

              Yes and no with regards to Box. If by “Bayesian” you mean using Bayes Theorem specifically then you need more than that. If by “Bayesian” you mean using the equations of probability theory in full generally (including but not limited to Bayes Theorem) then no you don’t need more.

      • First, error-statistical tests aren’t equal to NHST as that’s understood. Second, you won’t want to falsify based on low posterior probability–that’s unworkable. You can try and define a test rule leading from a low posterior to a falsification, but I can’t imagine scientists wishing to follow it. We never have an exhaustive set of theories, and the “catchall hypothesis” is to be given a small prior. Of course, once you add a brand new theory, the Bayesians previous computations also need to be changed. By the time it gets big enough to convince you that an entirely new theory needs to be developed, the error statistician would have pin-pointed the basic lines of a new theory to try using simple significance tests alone. I don’t think Gelman would disagree with me here.

        • Ok, so as I understand it you’ll concede that NHST is not the be all end all of testing, so on that we agree.

          However, on basically everything else I think we’re going to disagree, which isn’t too surprising. Falsifying = showing that a hypothesis is false. For the statistician this will have to mean “probably false”. Cox’s theorem shows us that if we want our calculations over which statements we are testing to agree with true/false discrete boolean logic, we need to use Bayes. So “You won’t want to falsify based on low posterior probability-that’s unworkable” is literally mathematically disproven by Cox’s theorem. Or, you can reject the axioms on which Cox’s theorem rest… But I don’t. We had this discussion recently on my blog:


          The notion that “By the time it [posterior probability of low prior probability model?] gets big enough to convince you that an entirely new theory needs to be developed, the error statistician would have pin-pointed the basic lines of a new theory to try using simple significance tests alone. I don’t think Gelman would disagree with me here.”

          is quite frankly an empirical claim we should test against situations in which people started out with some understanding that was wrong and then developed alternative theories. what reasoning did they rely on? Unfortunately I don’t have a comprehensive knowledge of the history of science, but I wouldn’t bet on error statistical tests being the method by which people “pin-pointed the basic lines of a new theory”. It might be the case that error statistical tests have been used in the past to show that a detailed theory didn’t predict the right thing, rarely would it be used to develop “the basic lines of a new theory” because error statistics are about the properties of random numbers, not the mechanisms of how nature works.

          Per Martin-Lof literally defined random sequences in terms of whether they pass or do not pass a very powerful statistical test of randomness. By this definition we arrive at statistical tests being the defining quality of infinite sequences from random number generators. I don’t really understand what the properties of abstract random infinite sequences actually has to do at all with anything in science (defined as the study of natural processes occurring in the world). I think this is the big misguided problem. It isn’t that statistical testing is wrong, it’s absolutely the right way to determine if a random number generator has the properties it claims to have…. the problem is that science isn’t the study of random number generators in the first place.

    • Wonks:

      Preregistration is fine. My applied work is typically more exploratory, but a good argument can be made that preregistration is valuable for exploratory work as well. Just to see what happens, maybe I’ll try preregistration for new papers I’m working on. One difficulty is that for most of my work I don’t really know how the paper is going to look until I do the research. Indeed, often I don’t start the paper until much of the research is done. Still, I could try and see.

      • For starters people should clearly declare whether their work is exploratory, like you did. But very few do.

        In any case, I think for exploratory work it’s fine not to preregister. But I think all other studies must pre-register.

  3. Since p-values occupy such a prominent role in scientific practice/publication, then _any_ criticism of NHST seems fair and consistent with the objective of improving our scientific work.

    A reviewer of a recent paper complained “There are no p-values in this paper! Not one!” as if that alone were a valid critique.

  4. Apparently observers in the U.K. had some fun last October when power poses became a stalking point of Conservative members of the cabinet, some of whom adopted awkward unnatural poses.


    From the first reading, the PP claim never seemed plausible, in part because the first rule of self-help may be to find what works for you as an individual. Still, who could have guessed that proper execution of the pose is a factor. Where does a person turn to build confidence if they fail at a proper pose? Seriously, though, if a researcher knows the conclusion they want — to find the cure for lack of self-confidence and they focus on one technique to the exclusion of other mechanisms, that is a problem. And if lack of confidence is like a disease, wouldn’t serious research divide groups into
    1) Confident with competence – justified belief
    2) Confident, but lacking in competence -unjustified belief
    3) Lacking confidence, but capable – not justified belief
    4)Lacking confidence, lacking competence – justified belief
    But that would have been more exploratory.

    • +1 for considering competence as well as confidence.

      Presumably unjustified confidence led Amy Cuddy to her fiasco, while possibly lack of confidence led Dana Carney (slowly) to increased competence.

        • Thanks for the link. I note especially her comment:
          “I have confidence in the effects of expansive postures on people’s feelings of power — and that feeling powerful is a critical psychological variable.”

          I am willing believe that she has the confidence she claims. But I am still skeptical of what she has confidence in. In particular, to me the (psychological as opposed to physics) concept of “power” is something like a religious concept. In other words, what she is trying to measure is to me such a fuzzy concept that I can’t see how it can be measured, except in ways that are so fuzzy that applying statistics to them is meaningless.

          So Andrew’s phrase of “chasing noise” seems to apply. Thus (at least in my view), she is an example of confidence without competence (because a competent researcher wouldn’t expound so enthusiastically about something that is best described as chasing noise.)

          • I agree; she seems to be chasing noise. This quote is telling:

            “The key finding, the one that I would call ‘the power posing effect,’ is simple: adopting expansive postures causes people to feel more powerful.”

            That’s cheating. It wasn’t presented as the “key finding” (or anything close) in the original paper, so it isn’t fair to call it the key finding now. She may mean that it’s the key finding across various studies; even so, she’s distorting the issue.

        • Yes, Cuddy’s statement is very sad. Unsurprising—she’s taking a page out of the Bargh/Baumeister playbook—but it’s still sad. Even when it comes to her particular study that got all the attention, she refuses to understand the meaninglessness of her p-values in the context of multiple hypothesis testing.

        • Well, I just quickly analyzed her testosterone result as that was the plot she used in her Ted talk.

          I can reproduce Fosse’s analysis exactly; compare this with p 7 of 03-ccy-replication-results.pdf:

          (Intercept) 53.5807 2.8236 18.976 < 2e-16 ***
          ctestm1 0.4546 0.1228 3.703 0.000775 ***
          chptreat 5.8217 2.7100 2.148 0.039133 * <=== HERE
          cortm1 -4.9894 27.2854 -0.183 0.856028
          cortm2 142.0457 42.9056 3.311 0.002261 **
          female -11.1731 3.5383 -3.158 0.003389 **

          However, if I add an interaction of the power pose assignment with gender, the key effect disappears:

          (Intercept) 54.1331 2.9062 18.627 < 2e-16 ***
          ctestm1 0.3862 0.1466 2.634 0.01288 *
          chptreat 4.8527 2.9437 1.648 0.10904 <=== HERE
          cortm1 -1.4782 27.6935 -0.053 0.95776
          cortm2 147.9227 43.6099 3.392 0.00186 **
          female -12.5397 3.8899 -3.224 0.00291 **
          chptreat:female 2.8056 3.2548 0.862 0.39511

          So, as someone already noted on this blog, gender matters here. I may have misanalyzed this as I did it quickly, but others can check if I messed up. Here is my code:

          ## original data (not used here)
          ## cleaned data

          ## sanity check: one subject, one row

          #drop ineligible and something else as in stata code:
          datc<-subset(datc,inelig!="Ineligible (drop)" & anyoutv1!="Selected")

          ## approximately matches published summary up to rounding error

          ## correct number of subjects

          ## center all predictors

          ## Fosse result:

          ## with interaction:

          ## sorry, Andrew, for ignoring your advice

          I think that Fosse should be congratulated for doing a good job with the reanalysis, although I am very skeptical of this “data cleaning” business of using three SDs as a cutoff for removing extreme values (these should be modeled). I lack the time to delve into this, but maybe someone else wants to do this using Stan. It’s a simple multiple regression, it’s not rocket science. Also,

          I must say that most of my time spent in analyzing this data (about 60-70%) was spent just trying to load the data-set. It’s 2016 and Yihui and Hadley have come and gone so to speak; how hard is it provide a loadable data-set in this day and age? I had to wade through a lot of junk files to find out what I needed.

          Also, Cuddy says she gave the data to a statistician. Fosse is a psychologist. Now, I know many psychologists who are basically indistinguishable from statisticians (some of them are even best friends forever), but I don’t think they would call themselves statisticians. I have an MSc in Statistics but I don’t really consider myself a statistician, because my main area of specialization is linguistics. I am sure Fosse knows what he is doing; I’m not questioning his competence. I am only questioning Cuddy calling him a statistican. Andrew is a statistician.

          My suggestion to Fosse is to (i) write an R package Hadley style and release that, (ii) make a vignette that weaves in the code and output neatly, with explanations, (iii) give more meaningful column names than anyoutv1 etc.

          • Also, assuming that the data and the analysis above is correct, I should have done the likelihood ratio test: model with all predictors vs model without the main effect of posing:

            Model 1: testm2 ~ ctestm1 + chptreat + cortm1 + cortm2 + female + chptreat:female
            Model 2: testm2 ~ ctestm1 + cortm1 + cortm2 + female + chptreat:female
            Res.Df RSS Df Sum of Sq F Pr(>F)
            1 32 8271.2
            2 33 8973.6 -1 -702.4 2.7175 0.109

            The t-test is technically not the right way to go here to evaluate effect of hptreat.

            • What the heck? The testosterone is *lower* for men when they adopt the high power pose? So the effect found without the interaction with gender included is driven by *women*?

              > round(with(datc,tapply(testm2,IND=list(female,hptreat),mean)))
              High Low
              Female 45 33
              Male 65 82

              So, if I ignore gender, I get a slight increase in testosterone overall, but it’s driven by women?

              > round(with(datc,tapply(testm2,IND=list(hptreat),mean)))
              High Low
              52 48

              Oh, now I see my mistake. In their fig 3, Carney et al plotted differences of testostone from before and after posing, not the absolute value of the testosterone.

              That gives us the big bump for men that’s expected and practically nothing for women.

              > round(with(datc,tapply(testm2-testm1,IND=list(female,hptreat),mean)))
              High Low
              Female 1 -2
              Male 12 -10

              The plot that Cuddy showed in her Ted talk had an 8 point increase overall for High power pose (they ignored gender in the plot), and about 4 point decrease for Low. In the cleaned data that Fosse created, it seems to be +4 and -4:

              > round(with(datc,tapply(testm2-testm1,IND=list(hptreat),mean)))
              High Low
              4 -4

              If that is the relevant comparison, shouldn’t the dependent variable in the linear model also be the change (the testosterone measurements after minus before treatment of low and high power) in testosterone as a function of low and high power pose? When I fit that model, this time, without the interaction with gender, in the cleaned data, I see no effect of power posing:

              > summary(m1|t|)
              (Intercept) 0.5055 3.3976 0.149 0.8826
              chptreat 6.1655 3.3737 1.828 0.0764 . <== Does not reach magical turning point
              cortm1 -32.6931 33.0813 -0.988 0.3300
              cortm2 100.6842 52.1607 1.930 0.0619 .
              female -1.4990 3.4730 -0.432 0.6688

              But now, using change in testosterone as the dependent variable, I can get the power pose main effect to come out after I take the interaction with gender into account!

              > summary(m2|t|)
              (Intercept) 0.2665 3.3600 0.079 0.9373
              chptreat 7.6745 3.5106 2.186 0.0360 * <=== BINGO!!!!! \o/
              cortm1 -32.7514 32.6705 -1.002 0.3234
              cortm2 99.5624 51.5195 1.933 0.0619 .
              female -1.2495 3.4347 -0.364 0.7183
              chptreat:female -4.5727 3.3525 -1.364 0.1818

              So, since I am done with my p-value hacking, I think I am at a point when we can go to press. Note that this is not what Fosse or Carney et al did, these are my shenanigans alone. What Carney et al seem to have done is the first model I fit above, using the second time period’s testosterone value as the dependent variable, and the first period’s testosterone value as a predictor (if I understand this correctly).

              This is indeed how we do our analyses in psycholinguistics too. We analyze the data one way, if it doesn’t work, we switch to another approach, until we hit significance. A common trick is to first try to fit a linear mixed model, fail to get a p-value low enough to publish, and then back off to repeated measures ANOVA (where you can wipe out one source of variance by aggregation) and report those values, hoping nobody will notice or complain.

              • My R output got messed up due to a stray less than sign. Here they are again:

                Without gender interaction, effect not significant:

                (Intercept) 0.5055 3.3976 0.149 0.8826
                chptreat 6.1655 3.3737 1.828 0.0764 . <=== FAIL /0\
                cortm1 -32.6931 33.0813 -0.988 0.3300
                cortm2 100.6842 52.1607 1.930 0.0619 .
                female -1.4990 3.4730 -0.432 0.6688

                With interaction: (effect significant)

                (Intercept) 0.2665 3.3600 0.079 0.9373
                chptreat 7.6745 3.5106 2.186 0.0360 * <=== BINGO!!!!! \o/
                cortm1 -32.7514 32.6705 -1.002 0.3234
                cortm2 99.5624 51.5195 1.933 0.0619 .
                female -1.2495 3.4347 -0.364 0.7183
                chptreat:female -4.5727 3.3525 -1.364 0.1818

              • Using the change in testosterone (or cortisol) as the DV is a different analysis to using time 2 testosterone as the DV with time 1 testosterone as the DV. The difference is sometimes referred to as “Lord’s paradox”. In this case the inferences are influenced by which of these two models is chosen. You are also right that the figure in the paper does not match the regressions results reported earlier.

              • Hi Mark,

                Should the third “DV” there be “IV”? If so, it sounds as though you are remarking on the difference between difference-score change and residual-score change. You are quite right that these are not the same, either mathematically or conceptually. An error that I see frequently is to use residual-score change but interpret it as difference-score change.

              • Carol, you wrote:

                ” An error that I see frequently is to use residual-score change but interpret it as difference-score change.”

                As far as I understand the Ted talk and Fig 3 in the paper, what’s needed as DV is a difference-score change.

              • Hi Shravan,

                I was just clarifying what Mark wrote, not making a judgment on whether difference-score change or residual-score change was being used or should have been used. I have read the article and seen the Ted talk, but both were long ago. I will have to look at them again before I can respond.

              • Hi Shravan,

                I just watched the Ted talk and glanced briefly at the article. CCY’s presentation is confusing. CCY write that they did “One-way analyses of variance examined the effect of power pose on post-manipulation hormones (Time 2),
                controlling for baseline hormones (T1). This is not a one-way ANOVA but either an ANCOVA (analysis of covariance) or
                a regression with the time 2 hormone predicted from the power pose group (coded 0 or 1) and time 1 hormone. This formulation means that change is residual-score change. But it does seem that they are interpreting their results as difference-score change. One could do a factorial 2-between (power pose coded 0,1) by 2-within (time: time 1, time 2) anova to get difference-score change. This could be be performed with either classic anova or regression. I must run now; I will come back to this later.

              • Carol, yes that third DV should be been IV in my post. Sorry for the confusion. It is weird that that the Carney paper presents a figure for change scores but uses the covariate approach for their analysis. Two different beasts as I said. Not surprisingly the two approaches give you different estimates of the size of the supposed power pose treatment effect. Its all a big mess of course.

          • Hi, Sharavan —

            Thanks for your analysis, and your kind words. My goal was to be as transparent as possible, and to encourage robustness tests to gain insight into the data.

            I want to clarify a few things:

            (1) Read the end note 2 (p. 1368) of the manuscript: the authors claim to drop outliers; as I describe in my executive summary, there are layers of errors here, too much to describe. I included these variables because it’s what CCY seemed to have used (at times!) to stumble upon their results. Accordingly, the *v1 and *v2 suffixes for the variables refer to my identification of hormones (according to CCY) vis-a-vis the factor variables they included.

            If you use Stata type in the console “notes” and you’ll get a list of the annotations I’ve included, clarified and carried over (or good luck with the SPPS source file!).

            (2) I’m a sociologist, not a psychologist. While I’ve been mistaken as a statistician, the training I’ve gotten has been great. I don’t know why I was called that in the press release, and I had nothing to do with that.

            (3) And I agree, many more analyses SHOULD be conducted; that was my goal in disseminating the data; to (perhaps) some confusion, I described it as a replication analysis, but not the way psychologists put it (as in running another experiment): rather, my goal was to transparently as possible show how CCY got to where they did with their data (something they couldn’t achieve).

            (4) I encourage other re-analyses and extensions: that was my goal of creating a tidy data set, with short labels, and so on. My goal was to do what CCY could not, and to document the errors.

            If there’s a lesson to psychologists, it’s that I hope they adopt standards of computationally reproducible research.

            One of my favorite citations
            Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten Simple Rules for Reproducible Computational Research. PLOS Comput Biol, 9(10), e1003285. https://doi.org/10.1371/journal.pcbi.1003285

            • In regards to Brian Nosek’s replication project, I remember that Dan Gilbert called Gary King a statistician, although his (King’s) PhD is in political science. Perhaps this is a Harvard thing? I note that both King and Fosse work at Harvard’s Institute for Quantitative Social Sciences.

              • Hi, Carol —

                Quite humbly, it’s not a Harvard thing. I’m a sociologist. So what’s going on? I think it has more to do with how (many) psychologists perceive (some) sociologists, economists, etc. (I was trained as an economist at the Kennedy School in an inter-disciplinary program).


                – Nate

            • Yes, i stupidly attempted to open the .tab file first, , but the no. of rows were unequal. Eventually i found the readme! It’s great you made everything available and the codebook is great but not what i am used to as an R user.

            • Nathan, I created a github archive and wrote up my analysis in knitr and provide a pdf with the results. It’s not perfect (I’ll add to this code and documentation over the next days) but this is roughly how I release data and code today, not how you did it. You released some 14 files. Just understanding what the structure of the files is takes time. As an outsider I just want to load your data and re-run your analysis. And all this point and click and download interface of dataverse is so 20th century.

              With my code, you just have to do the following on your command line

              git clone https://github.com/vasishth/powerpose.git

              and you have all the stuff I put up. Now just start RStudio and play and modify as you please. On twitter people used the github code and data to run a Bayes Factor analysis on the key claim.

              I couldn’t have done it without your initiative, but I just want to demonstrate a better way to do this.

              • Great, thanks, Shravan!


                Getting to the cleaned data was the most difficult part: as Gary King writes one cannot re-analyze a dataset without verifying this. That was my goal.

                The point of citing on Dataverse is to ensure transparency: like a published article the data has a “digital object identifier” uniquely identifies the data. Published re-analyses in academia should cite the data accordingly. In that spirit, please tell to cite the Dataverse doi at least (again, to ensure transparency).

                Your own your contributions, Shavran, but I’d love to ensure people going to Dataverse see your work. May I link to your github analyses and markdown files I the Dataverse meta-data (and readme file)? I won’t post your work directly because that’s all you.

                Glad people are finally using the data! Thanks, Shavran!

                – Nate

              • Hi Nathan,

                I believe github also allows doi’s. Besides, since neither dataverse nor github stuff is officially peer-reviewed, why do we care if it has a doi? Each has a unique online web address; I don’t really see what a doi gives me. An air of authenticity, perhaps. Maybe you can explain the added value of a doi to me.

                Feel free to link to or use it as you wish. I only did it to make the discussion more concrete; I was getting irritated that we were all criticizing Cuddy but we had no contact with the data. Then you released the data, and so it became possible to see what’s actually in this famous Ted-talk paper. It turns out, there’s nothing there, even by the usual criteria of alpha=0.05. Cuddy’s Ted talk was based on a fictional effect. She made a lot of money and got a lot of attention out this non-effect. Someone needs to do something about that. At the very least her talk needs to be pulled from Ted’s home page.

              • Shravan:

                I too think Github is great. Also I agree that Ted is irresponsible to be promoting Cuddy’s talk; that said, I thought this back in January when I wrote that article with Fung. I also thought David Brooks was irresponsible to never retract the various false numbers he’s promoted in his writings. I don’t know what it takes for people to admit their mistakes. I guess it’s a basic cost-benefit calculation: you compute:

                A: The embarrassment from admitting that you made a mistake,

                B: The embarrassment from being publicly associated with a claim that is generally recognized to be wrong,

                And then you decide at what point the embarrassment from the error becomes so large that you bite the bullet and admit it. I suppose the Ted people have their slide rules out right now, trying to make this decision.

                Gotta know when to fold ’em.

              • Hi Nathan (you seem to be finding more and more novel ways to write my name! :),

                Sure, I’ll link to dataverse, but I don’t really buy your argument about it being permanent. Is it more permanent than github?

                I think that your use of the word “replication” is pretty non-standard. I suggest calling it a re-analysis or something like that.

              • I agree that Amy Cuddy deserves a lot of credit for making sure this data got released. She deserves no credit at all for still insisting that that particular paper’s claims hold up under your scrutiny, Nathan. She said in the NYMag article:

                “As a result of this independent audit and additional peer feedback, all the paper’s coauthors did agree that a correction was in order and together we were collaboratively and currently preparing a corrigendum (a published correction) of our 2010 Psychological Science article. (For those interested in this level of detail, the corrections concerned a small mistake in the identification of statistical outliers on testosterone and cortisol, which does not change the findings, and the p-value for one of the two reported risk-taking measures, which is .049 for the likelihood ratio test versus .052 for the chi-square test.)”

                She’s not even hiding her desperate attempt to get below the magical 0.05 criterion!!! 0.049 indeed. 0.052 indeed. One of my students remarked recently that data is just used as decoration for publicizing your beliefs. The decorative aspect of data is on full display in Cuddy’s quote above. The published correction better conclude that the 2010 paper didn’t actually have any evidence for the claim about testosterone.

                Back in 1999, I quit doing syntax using intuitive judgements because I was struck by the subjectivity and bias that this method resulted in—whoever had more tenure got to decide if a sentence was grammatical or not. You could not challenge the judgement of a full professor. I thought that objective data would give a clearer picture; that’s how I ended up in psycholinguistics. Now I’m realizing that psycholinguistics is no different from that old methodology, but we can cultivate a sense of legitimacy and superiority because we have numbers to back up our claims. It’s the same in all these other areas.

              • Shravan:

                1. It’s good to have many different sites for data and analysis. Right now, it’s my impression that Github is best, but some redundancy in the supply is a good thing. So if Nathan wants to post things on his Harvard site, or I want to post things on my Columbia site, or whatever, that’s fine too. All these options are much better than the status quo, which is no data access at all!

                2. I agree that the term “replication” is confusing here. I’ve used “duplication” to refer to cleaning and repeating an analysis on the same data and “replication” to refer to performing the analysis on new data; see here, for example. But I recognize that different terms are used in different contexts, so it may be too late to get standardization and clarity on this one.


                I haven’t checked your site, but I recommend you copy onto your Readme document the list of specifics from Dana Carney: that is, the items listed under “Here are some facts” and “Confounds in the original paper” in this document. This information supplied by Carney seems pretty much necessary for any researcher who is interested in the relation between your data and analysis, and the Carney, Cuddy, and Yap 2010 paper.

              • Andrew, I will also link to Carney’s stuff (along with Nathan’s archive on dataverse). I plan to extend the analysis further, but busy the next few days. On the weekend perhaps.

              • To provide a bit more detail on doi’s’ vs url’s. In practice, urls are pretty unstable. Github could go out of business or be bought, and all github urls could change overnight. Or they could change their url scheme at any time, with the same effect. This kind of thing happens all the time. DOI’s, on the other, hand, can only be issued by organizations that have paid a membership fee to the DOI Foundation (I think it’s called), and have agreed to formal policies of persistence, specified here:


                Based on the social and technical infrastructure aimed at ensuring persistence, a specific doi should be *much* more stable over time than a url. That’s the hope, anyway. I’ve hit several dead doi’s, however, so not sure how well it’s working in practice…

          • Thanks for your advice as well, Shavran, but I’ll leave it up to you to expand on this work. I’ll give some advice when you don’t know a dataframe: look for the codebook.pdf! (After looking at the readme file).

            • P.S. I meant a bit more punctuation to clarify: if you don’t know the .csv files take a look at the codebook I created as pdf files. Not the best, but better than the SPSS files. Also, Dataverse let’s you explore the tabular dataframes interactively.

            • Nathan, the codebook was not really well set out. The first mention for the dependent variable explains it thus:

              1. correct replication of score by CCY
              2. slight differences in thousandths decimal point

              This isn’t useful to someone looking at the data for the first time.

              • @Shravan

                I’m responding to your question about DOI’s.

                Dataverse is a permanent data archive. The term “replication” is used in the literature confusingly, so let me clarify: I had posted the data that I was able to verify as the original data, something the original authors could not do (see my executive summary I have added for clarity, which also includes the list of the main errors I identified in the analysis, based only on the assumption that the authors were correct in their model identification [the point here is only to verify the data set so I can clean it, post it for public re-analysis and scrutiny]). King accurately calls this a “replication data set” because it verifies the original data. As King points out, this is very hard, especially when the original analyses are poorly documented.

                Why cite Dataverse when you re-analyze? Transparency! ESPECIALLY if you’re re-analyzing the data in novel ways and provide insight. Also Dataverse is data repository that encourages data citation, “maybe” a doi doesn’t cut it. On Dataverse, you KNOW the data are there and permanently available. I encourage you (and others) to read Gary King’s nice articles on data replication (in the computational sense), because it clarifies much of the logic of data citation (and clarifies the point of replication):http://gking.harvard.edu/files/abs/replication-abs.shtml

                Here’s a nice link from Dataverse on data citation:


                The key point is, whenever you re-analyze an existing dataset, it’s part of our open, transparent research culture to direct readers to verified source data so they may examine it themselves or provide additional contributions.

                If you have trouble reading the pdf documentation I recommend using “cntrl+F” and using a bit more patience. Or open the source data in SPSS if you’d like (but I don’t think you’ll find that terribly informative).

                Anyway, I’m glad my hard work to cite the data is finally getting attention. The errors I identified resulted in some harsh truths for CCY, but my main goal was to provide a cleaned data set that is available to the wider scientific community.

                Anyway, it’s a testament to Cuddy she commissioned me to post the replication data set publicly despite the fact that the data set resulted in numerous errors that (at times make no logical sense).

                Hope this clarifies things.

                In any case, I hope other disciplines follow the lead in political science and economics and publish their replication data.


                – Nate

              • Nathan

                > testament to Cuddy she commissioned me to post the replication data set publicly despite the fact that the data set
                > resulted in numerous errors that (at times make no logical sense)

                OK, but this how they relate that work and its importance – “For those interested in this level of detail, the corrections concerned a small mistake in the identification of statistical outliers on testosterone and cortisol, which does not change the findings” http://nymag.com/scienceofus/2016/09/read-amy-cuddys-response-to-power-posing-critiques.html

                A small inconsequential mistake…

          • We have a paper in review that looks at how the findings reported in Carney et al change for different plausible analyses (e.g., different outlier strategies, different control variables, different models, separate analyses for men and women etc.). The variability in effect size estimates that results is pretty astonishing.

              • Thanks for this! Fascinating! I haven’t read the paper yet, but scanned the R code on OSF (I think it was). I’m glad they include version controls finally.

                Often we’re confronted with model selection issues. Chris Winship writes well about this: do you see the multiverse analysis as a potential solution to the problem of model selection on the right-hand side? If so, are there advantages to a multiverse analysis over other methods of machine learning such as lasso regression?

                – Nate

          • @Shravan

            +1 for this great analysis. I really love how you invest the time to go into all the details and get your hands dirty even for a blog comment discussion.

            Too bad I’ve no creds as a linguist else I’d have applied to do a post doc with you or something! :)

        • I think that it’s astonishing that Cuddy says:

          “I also cannot contest the first author’s recollections of how the data were collected and analyzed, as she led both.”

          So she’s actually blaming Carney for however the data were gathered in Carney’s lab! It has nothing to do with Cuddy! It is amazing that Cuddy is making 60k a talk based on a study that she doesn’t take any responsibility for methodologically! (Wow, this is the first time in my life that I used three exclamation marks back to back! Xi’an, are you reading this?)

          I would love to see this systematic review Cuddy mentions, with the numbers. I want to see the funnel plot and I want to see a random-effects meta analysis. BTW, is it OK for Fiske that Cuddy mentions an unpublished systematic review to defend her position? After all, it hasn’t been reviewed yet through peer-review.

          It’s pathetic that researchers don’t maintain any uncertainty about their beliefs. At the very least one should carry around in one’s head a probability distribution representing the probability that different theories are true. How is it that every researcher is always right?

            • Thank you for the link. I appreciate Uri Simonsohn’s point that Cuddy’s TED talk has more of a motivational than a scientific quality. The problem, of course, lies in using flawed research to support one’s motivationalism (bad word used on purpose). Without the authority of such “science,” Cuddy’s TED talk would not have become so famous.

              There’s another side of the problem: Through such talks, the public gets a distorted idea of science and seeks more of the same. It becomes an industry. But I don’t need to tell anyone here this.

              I thought Simonsohn was extraordinarily gracious. Maybe his purpose was to convey a few key points and to emphasize the possibility of learning from mistakes and improving the methodology.

              • Diana:

                Also, it goes both ways.

                From one direction, the published-in-Psychological-Science stamp of approval is necessary for the Ted talk. Cuddy’s pitch is based on her credentials as a scientist.

                From the other direction, the Ted talk and the fame are big draws for researchers. The Association for Psychological Science has been aggressively promoting media connections for awhile.

              • Since I had already done the quick analysis of the Cuddy et al data (I prefer to call it that even though Carney was the first author, since Cuddy still believes in it), I thought it would be a fun after-dinner exercise to show my son (10) and wife some statistics they would understand.

                So we watched the Ted talk (I had seen it a long time ago), and we did a statistical analysis to check the claims in the talk afterwards. It was great fun and it was cool to watch them developing hypotheses to check the story. One idea that came up was to check if people with low initial testosterone values have bigger increases in testosterone, that was sort of the population Cuddy wants to help. There was a significant effect, but it was driven by one extreme value. Once you remove it, it was gone. It was cool to plot the correlation visually and watch it disappear once I removed a single extreme value. My son understood the point. Also, it was the subjects exposed to the *low* power condition that showed greater changes when they had low initial testosterone, suggesting that not doing the power pose may be empowering. I will put this up on github. I think both my wife and our child both got into the analysis big time and had a lot of fun. I will do this again with the red-means-I-want-sex data too one day, as I have it.

                I agree that as a motivational speaker Cuddy is great; she weaved in a personal story of recovery (two actually), and got teary-eyed towards the end, and was generally quite impressive. She got a standing ovation, and the talk was indeed great. I think that if the message is that people in low-social-power situations need to be more confident about themselves, I can get behind that. It is good not to slouch, not to withdraw within oneself. But it’s also not good to be what Cuddy wants you to be: take up more space, because that leads to intellectual manspreading.

                It’s a widespread disease that especially men suffer from. She wants unempowered people to gain confidence—this is a laudable goal and maybe Cuddy can in her post-Harvard years still do something positive. But it’s not just about presence; presence or self-confidence should be a consequence of ability acquired through hard work, or something you cultivate in addition to these things, not something you cultivate *first*. And even if you have acquired presence, humility and self-criticism should be more highly valued always.

                The type of populations Cuddy is talking about are women who are in socially oppressed situations. Their problem of low self-esteem and low self-confidence is a symptom of a larger problem (how they were conditioned when they were growing up, the cultural environment, the biases we hand down to women in education, etc.), it’s not a cause. It’s like if you have a blood flow problem in the brain or the like giving you a headache, the treatment for it is to take a Tylenol to get rid of the outward symptom. This will alleviate the underlying problem. No it won’t! If you want to elevate the position of women, it is irresponsible to ask them to power-pose for two minutes. Contribute to the hard work of providing them the same conditions and advantages that men get. But that’s a much harder project to do, and impossible to implement easily.

                Cuddy said a lot of things that she shouldn’t have. Apart from the unsupported “science” in the paper we are talking about that she apparently based her talk on, she encourages something that seems to have originated in America and is now spreading like a disease in Europe. She encourages people to believe that “presence” is the main thing; not competence or hard work. She said two disturbing things:

                1. “tiny tweaks lead to big changes.”
                2. “it’s not about the content of the speech, it’s about the presence”

                Why would you believe (1)? There’s no evidence in life that anything comes easily. Why would a two-min power pose lead to life-changing improvement? One has to sweat the details over a long period to make life-changing improvements.

                I see (2) a lot in Europe. Cultivate the look, forget the details and forget about doing the hard grunt work and the slow scientific hard work that one has to do to get somewhere. This is how the Human Brain Project got funded in Europe (look it up—it’s a joke and an embarrassment to European science).

                So, when I give my Ted talk, which I guess is imminent, I will deliver the following life-hacks:

                1. If you want big changes in your life, you have to work really, really hard to make them happen, and remember you may fail so always have a plan B.
                2. It’s all about the content, and it’s all about the preparation. Presence and charisma are nice to have, and by all means cultivate them, but remember that without content and real knowledge and understanding, these are just empty envelopes that may some fool people but won’t make you and better than you are now.

                There was a reason that Zed Shaw wrote Learn Python the Hard Way and Learn C the Hard Way books. There is no easy way.

              • So is the message with testosterone boosting through power posing that women are biologically at a disadvantage because they have less testosterone? And since they would never be able to achieve the high t levels that men can, does this imply that women will always act less confident than men? I always thought the confidence thing was largely cultural conditioning. Probably taking a look at matriarchal societies is necessary. Maybe the women there would have much higher t-levels than men, or higher levels than women in patriarchal societies?

              • Shravan, that’s a terrific story about the after-dinner exercise. I appreciated your point about substance and hard work, too (and quoted the end of your comment on my blog).

                It’s so far inside the thread that I can’t reply to your comment, so I am replying to my own.

              • Shravan, my understanding of the biology of testosterone is that women are much much more sensitive to it. If you injected a woman with enough testosterone to raise her blood levels to that of just an average 16 year old male you might kill her, or at least she’s be acting very weird and sick.

                The quantity you’ll want to track is not absolute concentration but the dimensionless ratio of concentration relative to a time-averaged “normal” value for that person.

              • +1 to many points in Shravan’s long October 2, 2016 at 2:06 am post.

                Among many well- put items, “intellectual manspreading” is a phrase that needs to be added to the general lexicon.

              • In response to Shravan’s and Daniel’s comments on testosterone:

                It seems that so many things are reduced to “testosterone,” essentially as a meme. But my impression is that very little is really known about its effects in different circumstances.

                For example, a number of years ago I was interested in reading the literature on sex differences in spatial visualization abilities. A lot related to the “nature or nurture” question. Some things I remember as being particularly interesting (but that I can’t vouch remembering correctly, nor for the soundness of the research, since that was before I got interested in statistics — although it was one thing that led to developing that interest):

                1. Some studies claimed that it was not the level of androgens that affected spatial visualization abilities, but the ratio of male to female hormones — men with (if I recall correctly) especially high ratios seemed to be especially bad at spatial visualization.

                2. It was also claimed (or speculated?) that a woman’s testosterone levels tended to be higher if she had a son or an older brother (i.e., that the androgens generated by a male fetus influence the mother’s hormones, and these in turn influence the hormones of a subsequent girl child.)

                3. Studies of women with a male twin showed that such women tended to be higher than average in spacial abilities, and that with very little training, came up to the level of their twin brothers.

                (I don’t remember any specific references, but remember that one person who did a lot of research in this area was Doreen Kimura).

              • > Simonsohn was extraordinarily gracious
                I think some of the motivation for tone is here http://datacolada.org/52

                Also, (as I commented early) making use of a “the world has changed” arguments to _forgive_ past behavior that would not be acceptable today. (Should be a literature on using this ploy in case law – named after someone with the name Bre…)

          • Meta analysis here will be hopeless.

            Cuddy mentions a number of authors who replicated the findings who are scared to publish – OK are there 1/2 as many who could not replicate who chose not to publish or 5 times as many?

            Also – how does one assess the confirmation bias – using researcher degrees of freedom to match the results of a bandwagon one wants to jump on?

            The well has been poisoned and preregistration is now absolutely necessary.

            • Still, I would like to see what the results of this “systematic review” are. I will pre-register my belief: like most psych* type researchers, their idea of a “systematic review” is to list out all significant and non-significant results they can find on a topic, then count the number of significant results, and if there are more significant results than non-significant, conclude that the effect is there. This is what people think a meta-analysis is. My field is littered with such absurdities.

              • > count the number of significant results

                Perhaps the lexicon for techniques people like using (and refuse to give up) that are not informed by good theory or logic but rather are based on convention or degenerate (in the technical sense) thinking could be “overvalued ideas”.

                “Wernicke was the first to distinguish overvalued ideas from obsessions, pointing out that they were never felt to be senseless by the sufferer … An overvalued idea could sometimes progress to full psychosis or was occasionally a manifestation of melancholia or general paralysis.” Disorders with Overvalued Ideas P.J. McKENNA http://bjp.rcpsych.org/content/145/6/579.full-text.pdf+html

                More generally the meta-analysis field is just full of them – most techniques used and promoted by meta-analysts are poorly informed by theory and extremely hard to change. (Part of the hard to change might come from the limitations the published studies create for how much improvement the better methods can provide.)

                For a recent for instance – Misunderstandings about Q and ‘Cochran’s Q test’ in meta-analysis David C. Hoaglin http://onlinelibrary.wiley.com/doi/10.1002/sim.6632/full

              • @Shravan: Sadly, I suspect your “preregistration view” is the reality, whereas a meaningful systematic review needs to 1) ferret out the kind of information on process that Carney included in her statement, and 2)classify the studies by quality, and 3) discount those studies (such as CCY) in which the method of data collection makes the data unfit for statistical analysis.

        • Also, what’s holding Cuddy back from re-running the Carney study after removing all the confounds Carney found (e.g., paying the subjects money, having told them they had won, before measuring them the second time could have been the cause for the increased values post-pose), and running a large sample experiment? I would also measure each subject multiple times, no-power-pose and power-pose conditions, make it a within-subject manipulation.

        • I put up the quick analysis here:


          One strange thing is, if the baseline range of testosterone for men is between

          Min. 1st Qu. Median Mean 3rd Qu. Max.
          30.99 47.81 60.56 70.47 90.37 143.60

          (in pg/l) then what does a 4 point increase through power posing mean in terms of practical significance? How does this compare with exercising for a certain amount of time? Anyone with expertise in this area reading this?

          • And how is it that some participant’s testosterone exhibited these crazy changes (both increases and decreases) over the course of an hour? We see the same thing in the Ranehill data and it makes me wonder how much the hormone measurement is characterized by measurement error.

            • If you google how to increase your testosterone, you get:

              Lose Weight.
              High-Intensity Exercise like Peak Fitness (Especially Combined with Intermittent Fasting)
              Consume Plenty of Zinc.
              Strength Training.
              Optimize Your Vitamin D Levels.
              Reduce Stress.
              Limit or Eliminate Sugar from Your Diet.
              Eat Healthy Fats.

              I see a lot of photos of profoundly ugly-ily over-muscled men. Nobody talks numbers and estimates. I want to see some numbers about what a 4 point increase in testosterone within a few minutes does to a person.

              • Shravan: While not particularly an expert on testosterone, as a (neuro-)biologist I can tell you that the effects of biomolecules are as context-dependent as anything you’d expect in a complex system like a human organism. Even sth seemingly simple as the growth hormones in an insect have context-dependent effects that are not easy to disentangle. Research in this area is roughly on the same level as that into psychological effects, fraught with uncertainty and lots of stuff we don’t know.

                Also, the tendency to oversimplify exists in media reports on biology as well. Testosterone = aggression, serotonin = “Glückshormon”, oxytocin = trust, cholesterol = bad for you, antioxidants = unreservedly good for you etc.

                As you would guess, given the large range of values, a 4-pt increase in testosterone might mean nothing at all or might mean different things for different people depending on lots of other different factors. In general, we know far too little about how levels of molecules in the body behave, how they fluctuate on different time scales, to meaningfully interpret them.

                Incidentally, that’s also a reason why when you go to the doctor, you should be very selective about what you measure in a blood test. The common idea that it’s better (or at least can’t harm) to screen for more parameters than what is medically indicated, or that “while we’re here, we might as well look at your liver hormones and triglycerides”, is misguided. What will happen is that some value or other will always be outside the “normal range” (where the reference values for “normal” change over time and are different for different test labs), and then off you go on a wild goose chase to find out what this means, and you do additional tests, and sometimes these tests will negate the previous finding or turn up other irregularities. Most often you will learn absolutely nothing from this, but will have contributed to the waste of medical resources and increased your anxiety levels for no good reason.

                OK, I lost my track here, but you know what I mean.

              • Alex, I know exactly what you mean! I have been a kidney transplant patient since 1984, and a dialysis patient since 2011. I’m a professional consumer of blood tests. The number of times my creatinine crossed the normal range, triggering total panic, over those first 25 years, was … too many to count.

                Also, my wife is diabetic, and her machine can give a difference of 100 points in measurement of blood sugar from one minute to the next, which is an impossible change in that short period.

            • Amy Cuddy said recently in nymag in jesse singal’s article:

              “As a result of this independent audit and additional peer feedback, all the paper’s coauthors did agree that a correction was in order and together we were collaboratively and currently preparing a corrigendum (a published correction) of our 2010 Psychological Science article. (For those interested in this level of detail, the corrections concerned a small mistake in the identification of statistical outliers on testosterone and cortisol, which does not change the findings, and the p-value for one of the two reported risk-taking measures, which is .049 for the likelihood ratio test versus .052 for the chi-square test.)”

              My analysis contradicts her claims. I think that the Ted talk needs to be removed from Ted’s home page.

              • “I think that the Ted talk needs to be removed from Ted’s home page.”

                Better than removal: A rebuttal or (still better) a refutation needs to be added (or replacing the original talk, with only a link to the original talk).

              • We had such a re-analysis in review at Psychological Science. After one round of revisions the paper was rejected. Still seems odd to me but Amy Cuddy was one of the reviewers (she signed her review).

              • Shravan —

                Thanks for pointing this out. As you can tell, at least, I was independent, but they weren’t looking at my replication: I found that it was p = .050 back in May.

                To be clear, as I point out in my executive summary, even the p-value was mis-reported if it **were** the correct one (this is what I mean by multi-layered forking): it should be p = .050 (to two decimal places). This is one of those things that I don’t understand except that there are 10 simples every scientist should follow.

                Update: I had my own sanity check.

                Confirmed, my replication test confirmed their reporting error back in May of 2016: p = .050

                (see file: 03-ccy-replication-results.pdf)

                Pearson chi2(1) = 3.7667 Pr = 0.052
                likelihood-ratio chi2(1) = 3.8574 Pr = 0.050

            • Hi, Shravan —

              I’ll be sure to link your Github analyses; I think it would be posted under “related material” but I want to be sure how to do it.


              – Nate

  5. Andrew: In case the blogpost you couldn’t remember was mine, the point of the “whiff of inconsistency” wasn’t that a critic wouldn’t care. The point was that they ought to be shocked and concerned if a researcher is made to resign and give up their Ph.D, etc. based on an analysis using significance tests. Whether it’s showing data too good to be true, or providing strong corroboration that the assumptions do not hold, or lack of replication, the methods employ (often creative, often very old) significance tests and significance tests reasoning. Simonsohn’s fraudbusting is all based on such statistical tests. I recall being struck at the time that
    “The p-value critics are happy to trust p-values when they show results too good to be true and a variety of others QRPs—even when the upshot is the loss of someone’s career”.

    That said, I completely agree with you about the illicit inferences drawn in the studies you criticize; and the criticisms are based on error statistical reasoning: the impressive -looking result, far from being difficult to achieve under mere background noise alone, is quite easy to achieve even by chance alone. Error probabilities may be used to inform us about how easy or difficult it is to generate results under various assumptions about the data generation. Your multiverses are playing the analogous role.
    I know you know all this, but thought I’d clarify that main point about where the “whiff of inconsistency” enters.

  6. Andrew

    I’m not sure if this thread is still active, but I did want thank you for suggesting Carney’s memo in Dataverse; fortunately. As well, I to underscore that even for those who don’t want to analyze the “power posing” study, I have drafted an eight-page summary of the errors in the replication analysis. These errors are numerous, (substantively) inconsequential, and contradictory; I can’t make any clear conclusions from the data but they reveal patterns that at least call into question the lead author’s approach toward data analysis: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FMEGS6#

    Can we Clarify Replication

    I think we can. I wanted to share this great article that clarifies differences between a replication analysis to verify a study (and detect errors, fraud, etc.) and robustness tests such as Shravan’s, which are to compare different models; the two are not the same, and they have different implications. Clemens (2016).

    • Nathan:

      I don’t know what you mean when you say “call into question the lead author’s approach toward data analysis.” The analysis is the responsibility of all the authors, surely?

      • Hi, Andrew:

        Total typo. I was thinking of Carney’s memo as I was writing that, but I mean CCY; totally unfair to isolate her given all three were on the manuscript and given her transparency (albeit six years later).

        Correction: “call into question CCY’s approach toward data analysis.”


        – Nate

Leave a Reply

Your email address will not be published.