No, this post is not 30 days early: Psychological Science backs away from null hypothesis significance testing

A few people pointed me to this editorial by D. Stephen Lindsay, the new editor of Psychological Science, a journal that in recent years has been notorious for publishing (and, even more notoriously, promoting) click-bait unreplicable dead-on-arrival noise-mining tea-leaf-reading research papers. It was getting so bad for awhile that they’d be publishing multiple such studies in a single issue (see, for example, slides 15 and 16 of this presentation) or just enter Psychological Science in the search box on this blog).

This editorial seems great. Lindsay talks about replication problems and how researchers should do better. He warns about p-hacking, noise, and the difference between significance and non-significance not being itself statistically significant. In his letter he never quite says that Psychological Science itself has published papers with weak to no statistical evidence, but I guess that’s a political thing. Best in my opinion would be to (1) acknowledge current and past problems and then (2) do better in the future.

But if the Association for Psychological Science is too constrained to do (1), I’m still happy for them to do (2).

Lindley concludes with this upbeat statement:

The editors of Psychological Science are confident that we can reduce the rate at which Type I errors are published without compromising other values (e.g., interestingness, relevance, elegance), and that is what we intend to do.

I believe in type 1 error about as much as I believe in yoga, kings, elvis, zimmerman, and beatles, but I appreciate the general sentiment. To be more precise, I might expect a decline in interestingness but an increase in relevance.

Measurement, measurement, measurement (and design): Doing better statistics is fine, but we really need to be doing better psychological measurement and designing studies to make the best use of these measurements

There is one big thing I’d add to Lindsay’s statement, and that’s measurement and design.

Lindsay does talk about low power, which you get when data are noisy, but I don’t think this is enough. I worry that readers of his note will get the impression that non-replicability is a statistical problem or maybe a procedural problem to be solved by reforms such as preregistration and minimization of p-hacking. But fundamentally I think it’s more of a problem of measurement and study design, a point I’ve been making for the past year or so in this space.

One reason so many of these Psychological Science studies are so dead on arrival is that they hinge on noisy measurements in uncontrolled, between-subject designs. That puts you right here, and no amount of preregistration or fancy statistics is going to solve your problems.

When people asked me if I thought the fat-arms-and-voting study or the ovulation-and-clothing study or the ovulation-and-voting study should be subject to preregistered replications, I said: sure, if you want to replicate these studies, go for it, but I wouldn’t really recommend wasting your time. The measurements are so noisy that such replications would be primarily of methodological interest, just to demonstrate that with new random data you’ll be able to find new random patterns.

So I’d love it if the official statement from the Psychological Science editor emphasized that performing more replicable studies is not just a matter of being more careful in your data analysis (although that can’t hurt) or increasing your sample size (although that, too, should only help) but also it’s about putting real effort into design and measurement. All too often I feel like I’m seeing the attitude that statistical significance is a win or a proof of correctness, and I think this pushes researchers in the direction of going the cheap route, rolling the dice, and hoping for a low p-value that can be published. But when measurements are biased, noisy, and poorly controlled, even if you happen to get that p less than .05, it won’t really be telling you anything.

And some other things

As noted above, I like Lindsay’s editorial. But there are a few places where I’d say things differently.

I’m loath to make these comments because I don’t want to dilute the major points I just made above, and I certainly don’t want to piss of Lindsay, who seems to be on my side in this general issue.

But ultimately I think I’m more effective when I just say what I think (at least, when it comes to my areas of expertise). So here goes. But, again, let me emphasize that in my pickiness here, I’m just trying to help, I’m not trying to get into any fights.

1. Garden of forking paths. Lindsay decries “p-hacking”: “practices that inflate the Type I error rate, such as (a) dropping subjects, observations, measures, or conditions that yielded inconvenient data; (b) applying poorly motivated and post hoc data transformations; (c) using questionable covariates; (d) suppressing mention of experiments that were conducted but ‘didn’t work’; and (e) using the optional-stopping strategy . . .” I agree with Lindsay that these are problems “whether these sorts of things are done innocently or nefariously.”

But I think he should go further. Eric Loken and I use the term “garden of forking paths” to refer to the many choices in data processing and analysis that can be taken, contingent on data. The key point is that even if you, the researcher, do only one analysis of existing data, your p-values will still in general be wrong if you could have done something different, had the data been different. It’s a Monty Hall kind of thing.

This upsets some people—they don’t like to be penalized, as it were, for analyses they didn’t do—but, sorry, that’s the logic of p-values. As Eric and I explain in our paper, the p-value is necessarily defined based on what you would’ve done. If you don’t want outsiders speculating on what you would’ve done, had the data been different, you can preregister or you can use other statistical methods. If you want to play the p-value game, you gotta play by the rules.

Anyway, I think it’s important to emphasize this “forking paths” thing. Otherwise I fear that researchers will think that, because they only did a single analysis on their dataset, they haven’t p-hacked. Just a sentence would do here, something like this: “P-values can be invalidated by p-hacking or the garden of forking paths, even when only a single analysis was performed on the existing data.

2. Moving away from “power.” I appreciate all the warnings about noisy, low-power studies. But ultimately I don’t think power is quite the right way to look at this. The trouble is that “power” is all about getting statistical significance (p less than .05), which isn’t really where it’s at. John Carlin and I discuss our preferred framework in terms of type M and S errors in our recent paper in Perspectives on Psychological Science.

3. Abandoning “statistical significance.” Lindsay expresses concerns about “a p value only slightly less than .05” but I feel that the implication is that the p-value maps in some direct way to evidence. To disabuse you of this attitude, I refer you to this classic example from Carl Morris.


Overall I think this is all a step in the right direction, and I’m very happy that the editor of Psychological Science has released this statement.

Next stop, PPNAS. (Ha! That’ll be the day.)

35 thoughts on “No, this post is not 30 days early: Psychological Science backs away from null hypothesis significance testing

  1. From the editor “As I often tell my students, “If scientific psychology was easy, everyone would do it.””

    Reminds me of the Freud quote (roughly) “There will unlikely be a scientific psychology in my lifetime, but I have no intention of changing careers”…

    Better measurement and design may likely be a rather big step for many.

  2. An excellent post.
    I would love it if Psychological Science moved away from p-values but because the journal used p-values as a publication decision criteria in the past they should also act to correct some pretty famous papers that simply misreported p-values to get below the .05 threshold.
    The case of Amy Cuddy doing this was discussed here a while ago and the issue is also noted on pubpeer ( but there is another pretty high profile author who appears to have pulled a similar stunt in Psychological Science.

    • Mark:

      As discussed recently, I think it would be in practice impossible for Psychological Science, PPNAS, etc., to retract all their fatally flawed papers. There are just too many. And, whenever you catch a paper with a fatal flaw, the author can respond that the errors are not serious and that in any case the finding is true and has been successfully replicated many times. Even Daryl Bem made that claim! The claim was ridiculous in Bem’s case, just as it was with the elderly-priming study and the fertility-and-clothing study, but my point here is that such a claim can be made, and it will be taken seriously—that is, errors in a published article are typically not considered correction-worthy if the author can make a plausible argument that the underlying claims are true—and thus any retraction procedure can potentially be dragged out for a long time. Nobody has the resources to perform this sort of investigation for all or even many of the fatally-flawed papers out there.

      • I suspect that you are right in a this-will-never-happen kind of way but I feel that some high profile retractions or corrections would have a real effect because we need to re-establish the norm of honest reporting of data. Some of these “errors” seem entirely distinct from what you so eloquently write about (and which I fully agree with) inasmuch as they seem to represent outright data fudging.

  3. Anyone has suggestions of books or papers about measurement that can be applied to Psychology?

    A small rant:
    I know that there are many, many books on psychometric theory. But when you try to wade in the mud of construct validity and how it is applied in research, you get stuck quickly. When it comes to defining construct validity, there is a plethora of definitions, not necessarily agreeing with each other, and many more ways to apply them in practice. What constitutes a ‘valid measure’ is also quite hazy: it usually means that a questionnaire with many items measured on an ordinal scale has high Cronbach’s alpha, some sort of coherent factor loading matrix and maybe some high correlation to similar questionnaires. Sometimes there are also IRT models. When the scale has an direct, applied use, the ‘criterion’ validity usually makes a good case for it (i.e.: how well the score in a scale predicts major depression, etc.)

    In summary: there are many ad hoc ways to “validate” a psychological measure, most of which I find lacking. Nevermind the mind projection fallacy that is the rule when interpreting the results of a factor analytic or IRT model (see, e.g.,; nor the usual hunt for significance.

    So,I would like some outsider’s perspective on what could be a good measurement and how well the properties of a good measure could be applied to Psychology.

    • I find the exact same phenomenon. In my experience, investigators most often use ‘reliability’ as a criterion for a good measurement technique, with the natural conclusion that measurements that appear the same over and over must be better than one that varies. Thus, self-reported outcomes, which appear highly replicable, must be better than ‘objective’ measures (e.g. blood pressure), which vary over time. In the absence of a gold standard, which is most often the case, evaluation of measurement protocols is extremely difficult and, I think, unbelievable.

  4. Andrew, you say “If you want to play the p-value game, you gotta play by the rules.”, and so I am driven to ask , what are those rules?

    I think that if you set out what you see as “the rules” you will find that you are constructing a Neyman-style hypothesis test procedure that does not need P-values at all because it is constructed to provide a set of rules for behaviour that control the rates of type I and type II errors. If you want rules for producing P-values that support an evidential interpretation then you need a different set of rules from those implied. Of course, the rules for reacting to an evidential P-value need to be more subtle than declarations of significant/not significant.

    It should never be the responsibility of an evidentially derived and interpreted P-value by itself to protect against mistaken inferences. Confirmatory studies with pre-planned analysis should be mandatory follow-ons from the preliminary, exploratory studies that are commonly presented as if they were definitive, designed studies.

    The problems cannot be communicated to non-sophisticated users of statistics until we get the ideas and terminology clear.

    • Michael:

      The p-value is Pr(T(y_rep) >= T(y)) under the null hypothesis, and the rule is that T(y_rep) has to be defined for any y_rep that could’ve occurred under the null, which in turn requires an assumption of how the data would be processed and analyzed had the data been different. That is, the function “T” (the “test statistic”) encodes all the researcher’s degrees of freedom. This is the garden of forking paths. People don’t always recognize this point, instead thinking that T is defined based only on what was done with the data at hand. By “If you want to play the p-value game, you gotta play by the rules,” I mean that if you want to summarize your inference by p-values, you must have a model for what you would’ve done under other possible datasets you might have seen.

      • OK, that is one way to define a P-value. It is not the only possible definition, and it is not the standard definition because the “null hypothesis” in that definition is gloabl sort of null roughly equivalent to the hypothesis that nothing interesting has happened. I think it is a poor definition because it precludes the P-value from standing as an index of the evidence in the data.

        This is an extract from my upcoming commentary on the soon to be published ASA statement on P-values:

        The choice of analytical procedures should be informed by the nature of the study because if you restrict your attention to answering the first question [What do the data say?] you can identify the areas where cherries are most numerous and ripe without picking them. Data from preliminary or exploratory studies intended to determine fruitful directions of enquiry can be interrogated repeatedly and intensively and results can sensibly be assessed and communicated on the basis of observed P-values, even if the study involves many comparisons, even if the comparisons are unplanned, and even if the sampling rules were ill-defined or flexible. No `correction’ of those P-values for multiplicity of comparisons is necessary—or desirable—because what the data say about one hypothetical effect is not influenced by whether the analyst sees what the data say about another hypothetical effect. In contrast, if those same P-values were used with hypothesis testing procedures to provide the basis for decisions regarding hypotheses then claims of `cherry picking’ and `P-hacking’ would usually be correct. A pre-study power analysis is required, and all of the comparisons to be made must be included for the loss function to be correctly calibrated. Thus P-values used within a hypothesis test decision procedure often need adjustment to take the actual experimental design into account lest the statistical support for decisions or actions is weaker than claimed or implied because of a higher than reported risk of false positive outcomes. Exploratory studies should not be misrepresented as planned studies yielding answers to the third question [What should I do or decide now that I have these data?].

        • Michael:

          No, this is the only definition of p-value. If you want it without symbols, we can quote wikipedia: “More specifically, the p-value is defined as the probability of obtaining a result equal to or ‘more extreme’ than what was actually observed, assuming that the null hypothesis is true.” What I’m calling T(y_rep) is what wikipedia is calling “obtaining a result.”

          Your comment is all about what p-values are used for, and that’s fine, I’m just talking here about the definition.

        • Wow, I wasn’t expecting Wikipedia! That definition is more conventional. I note that it does not specify what the null hypothesis is, or should be. Your preferred global null is not the only possible null, and my point is that your global null precludes the useful evidential interpretation of the resulting P-value.

          The P-value can be defined in terms of probability, in terms of quantiles of model-predicted results, and in terms of decision errors. It is usually defined without specification of the scope of the null hypothesis, without mention of statistical models, and without specification of the model parameter(s) that provide the null hypothesis. You are brave to make a claim that your definition is the only one.

        • Michael:

          I don’t always agree with wikipedia, but I think they nailed this one just right. You say the p-value is usually defined “without mention of statistical models” but a statistical model is implicit in the phrase, “the probability of obtaining a result . . . assuming that the null hypothesis is true.” The implication is that the null hypothesis can be used to determine probabilities. I call that a statistical model.

        • Yes, the model is implied by the use of probability, but the model is not specified. Thus a simple model where the null is a single parameter value relating to a particular comparison of interest (i.e. one of the garden paths) can yield a P-value that is `correct’ according to the wikipedia definition, without reference to the other potential garden paths. Your definition, in attempting to encode “all the researcher’s degrees of freedom” would entail a different P-value. Thus your definition does something different from the Wikipedia definition. I think it’s a different definition.

        • A (in this case one-sided) p value is always and only:

          p = 1/N * SUM((T(RNG(H,i)) > T(D) ? 1,0), i=1..N)

          for N extremely large (let’s say 10^100) and RNG a random number generator that produces vectors of “pseudo-data” of length equal to length(D) and determined by a given “null hypothesis” denoted H and T a test statistic function that maps vectors similar to D onto a real number. (here I’m using the C syntax a ? b : c to mean (if a is true then the value of the expression is b, otherwise the value is c)

          In other words, p is conditional on the definition of H, T.

          there is no “global” or “alternative” H, for every p value calculated, there is one particular H and T relevant to that p.

        • So, next I’ll use P to mean probability whereas lowercase p just means “p value, as calculated above”

          Suppose that D is the given data, H a choice of null hypothesis, T a choice of test statistic function, and K knowledge that we have about how the researcher chose H and T.

          P(p < 0.05 | H,T,D,K) = P(p < 0.05 | H,T,K) p(H,T|D,K)

          note that | H,T does not mean "given H and T are true" but rather "given H and T were chosen for the calculation of p!"

          we're now free to model the affect of our knowledge K on the probability that p < 0.05.

          If we have K = "I know that the researcher grabbed D, chose a variety of H,T until they got p < 0.05 and then published that result" then P(p < 0.05 | H,T,D,K) = 1 independent of anything else.

          if we have K = "I know that the research looked at D, thought about what different things might be relevant to test given how D looked, and then chose an H, T without doing any explicit calculations and then published the resulting p value" then we need to assign a different probability, and it will depend a lot on K and D and what we know about what the researcher knows about how to do statistics.

          if we have K = "independent of what happened with D, the researcher had pre-specified an H,T to use before collecting the data, and we know that actually D really IS produced by a random number generator specified by H,T" then we are probably forced into P(p < 0.05 | H,T,D,K) = 0.05

          that's the only situation where p is a number that matters to a Bayesian

        • rewriting my math into words: a p value is a frequency with which a particular random number generator described by H produces pseudo-data values that make a function T have a value “more extreme” than the value T takes on when applied to the actual data D.

          In the second comment, we can rewrite as: “the probability that a given research agenda will produce a p value less than 0.05 depends on the details of the research agenda, and our model for the choice of H,T, and what we know about how D is actually generated”

          so, probability that p < 0.05 depends on "researcher degrees of freedom" such as P(p<0.05 | H,T,D,K) = P(p < 0.05 | H,T) P(H,T | D,K) where the first term is our model for how p will turn out given that H,T were chosen (and also given what we know about what is really going on) and the second term is our model for the internal thoughts of the researcher!!

        • >”that’s the only situation where p is a number that matters to a Bayesian”

          Say I measure X multiple times at a number of locations and want to make a map (2D matrix) of this data, but have a different number of measurements at each location. I don’t want multiple maps, one with the mean and the others with variance and sample size in each cell, but I do want to consider the uncertainty around the mean value. I can combine the mean/N/var information into a single summary statistic at each cell, the p-value, and make a map of those values. By looking at this single map I can then see which regions are likely to yield large X.

          This seems like a perfectly legitimate use of p-values.

        • Daniel:

          Your comments make me think folks should be forced to emulate inference procedures before learning about them mathematically – to emulate (simulate) data one has to specify a (probability) model and an explicit map from D to T and for it to be a realistic/relevant emulation one has to consider K.

          > internal thoughts of the researcher!!
          Purposeful thoughts of the researcher would more inline with Peirce/Wittgenstein/Ramsay view that logic involves purposes.

          Sometimes people even think the distribution of p_values under H0 is _defined_ to be uniform – rather than just a hope that H,T,D and K were just right.

        • Anonuoid:

          In the case you’re mentioning, typically what you have to do is specify a model class, and then fit the parameters mean, and standard deviation for example, to the data. At that point, your generative model is approximately true, (ie. the data “really is, approximately” generated by the RNG specified by H). So, then we’re in the last case of my comment.

          And yes, filtering data down to select out the “unusual” cases based on a model for how the data is generated (ie. NOT a “NULL” hypothesis, but a kind of surrogate approximation to reality) is pretty much the main legitimate use of p values. The other one is testing random number generators where you’re supposed to be forcing the “null” hypothesis to be true.

        • Daniel wrote: “typically what you have to do is specify a model class”

          That is rather arbitrary in the way I am using the p-value here. What I mean is just get a t-statistic by calculating mu*sqrt(n)/sd and scale to be between zero and one by using the tail probability beyond that value of some distribution, the t distribution should work fine.

          The p-value is just being used because it is a single statistic with an expected value for any given combination of mu,n,sd. For example, from this Chernoff plot (R package: aplpack) it is clear that p tends to increase with mu and n but decrease with sd:

          p.s. Has anyone ever seen one of these Chernoff plots in the wild?

        • Anonuoid: you’re talking about doing a probability integral transform to take an unconstrained variable and put it on a 0,1 scale, without regard for trying to enforce a uniform distribution on the 0,1 scale…

          since it’s one to one and onto, it’s just equivalent to using the t value itself. Without fitting the t distribution parameters in some way you will get funky results. For example, suppose you have a measurement in meters that ranges from -100 to 100, and you map it through a standard say normal distribution, the “scale” here is 1, so you’re going to be getting out p values that are either 0, or 1 and hardly anything else. If you rescale the normal to have standard deviation say 50, you’ll get a range of p values from 0 to 1 and including things in the range 0.5 etc

        • >”you’re talking about doing a probability integral transform to take an unconstrained variable and put it on a 0,1 scale, without regard for trying to enforce a uniform distribution on the 0,1 scale…you’re going to be getting out p values that are either 0, or 1 and hardly anything else.”

          I agree now, an arbitrary distribution would not work well.

        • I think Andrew is claiming that that is the only definition under which the p-value is guaranteed to have its advertised properties, ie a test of p < alpha has size (Type I error rate) of <= alpha.

          There are other ways of defining p-values, but then they are only "nominal" p-values with no frequentist guarantees.

        • Yes, I guess that is what he means too. However, a P-value need not have an unconditional frequentist guarantee in order to be a `proper’ P-value. P-values are not error rates. That’s why we have alpha and `test size’ as parts of the hypothesis test design.

          It is wrong to say that a P-value has “advertised properties” because those advertisements misconstrue the actual nature of P-values. P-values.

        • I’m having a hard time understanding what alternative definitions of a p-value were suggested by Michael Lew and Dean Eckles.

          Andrew says: p-value = Pr(T(y_rep) >= T(y))
          Daniel Lakeland says: p = 1/N * SUM((T(RNG(H,i)) > T(D) ? 1,0), i=1..N)
          Mayo and Cox (in [1]) say: p-value = p(t) = P(T > t; H0)

          Notation quirks aside, they seem to refer to the same quantity: tail areas of the distribution of a test statistic beyond an observed value. Mayo’ and Cox’s notation make it clear that this distribution is defined under some null hypothesis. Is there any other way to clearly define a p-value? If there is, I would say it’s not a p-value anymore.

          Another thing entirely is how we use this quantity to make inference. Inductive behavior with known long-run Type I error rates? (Mis)fit summary under a hypothetical distribution? Inductive inference based on the evidence against the null?

          Mayo’s evidential reasoning states that “y is evidence against H0 if and only if a less discordant result would have occurred if H0 correctly described the distribution generating y.” Evidential reasoning also depends on the distribution of the test statistic, which is only well-defined and meaningful if it does not change based on the data.

          Cox states very clearly that an unexpected p-value and “high statistical significance on its own would be very difficult to interpret, essentially because selection has taken place and it is typically hard or impossible to specify with any realism the set over which selection has occurred”. If you are doing exploratory work, why use p-values at all to select what is interesting and what isn’t?


        • Erikson: I defined it my way to emphasize two facts:

          1) it depends on H, T, and D
          2) It’s a *frequency* with which a particular random number generator produces a certain kind of fake data.

          for Mayo, all probabilities are frequencies, so part (2) is implicit

          Andrew’s notation is ambiguous on the points (1) and (2)

  5. “Measurement, measurement, measurement (and design): Doing better statistics is fine, but we really need to be doing better psychological measurement and designing studies to make the best use of these measurements

    There is one big thing I’d add to Lindsay’s statement, and that’s measurement and design.

    Lindsay does talk about low power, which you get when data are noisy, but I don’t think this is enough. I worry that readers of his note will get the impression that non-replicability is a statistical problem or maybe a procedural problem to be solved by reforms such as preregistration and minimization of p-hacking. But fundamentally I think it’s more of a problem of measurement and study design, a point I’ve been making for the past year or so in this space.”


    It’s not just a problem with psychology. The other day, I accompanied a friend to their cardiologist appointment. The friend mentioned to the cardiologist that I was interested in statistics, including some of the problems with medical research. The cardiologist said, “We always hire a statistician before we publish our research.” I replied that one part of the problem is that the statistician needs to be hired before the study is conducted, to be involved with the design and implementation of the study.

  6. II agree with most of what Gelman says; I’ve been stressing that the central problem is the artificial experiments and proxy variables in social psychology and related areas for years now. It’s almost never taken up by the replicationists.
    However, Gelman seems to suggest that you can escape the perils of cherry-picking, data-dependent selections, and assorted biasing selection effects if you just switch to a methodology that is unable to pick up on them. I see this as precisely the problem with those methods (e.g., likelihood ratios, Bayes factors, Bayes updating). Since Savage (1962) or before we’ve been hearing of the “simplicity and freedom” that has been lost by stressing frequentist error probabilities. Granted if you don’t care about controlling the frequency of erroneous interpretations of data, then you don’t care about gambits that diminish these error probing capacities. See “who is allowed to cheat”
    For some of us, that’s scarcely a selling point for relying on the inferential method in science.

    • Mayo:

      No, I don’t think that you can escape the perils of cherry-picking, data-dependent selections, and assorted biasing selection effects if you just switch to a methodology that is unable to pick up on them.

  7. I was reminded of this post when I just read the most recent evolutionary psychology offering from Psychological Science

    Here the authors hang their entire hat on an interaction effect that is incredibly weak (delta R-squared =.01, but p =.048) and that appears to only be present when you venture fairly far down the garden of forking paths (e.g., controlling for all sorts of variables, and performing square-root transformations of variables just because they were positively skewed). The editor of Psych Science may be preaching one thing but the practices of his journal continue to be quite the opposite.

Leave a Reply

Your email address will not be published. Required fields are marked *