Skip to content

There’s something about humans

An interesting point came up recently. In the abstract to my psychology talk, I’d raised the question:

If we can’t trust p-values, does experimental science involving human variation just have to start over?

In the comments, Rahul wrote:

Isn’t the qualifier about human variation redundant? If we cannot trust p-values we cannot trust p-values.

My reply:

At a technical level, a lot of the problems arise when signal is low and noise is high. Various classical methods of statistical inference perform a lot better in settings with clean data. Recall that Fisher, Yates, etc., developed their p-value-based methods in the context of controlled experiments in agriculture.

Statistics really is more difficult with humans: it’s harder to do experimentation, outcomes of interest are noisy, there’s noncompliance, missing data, and experimental subjects who can try to figure out what you’re doing and alter their responses correspondingly.


  1. I’m not sure it has anything to do with whether one uses classical methods or not. With human subjects one just can’t get a stable signal, period. In my research, I can’t even replicate simple, classical effects that I believe are probably true because there is so much prior work out there. And I am a T-shirt wearing Stan user.

  2. Eric Loken says:

    I don’t know about “start over”. There’s plenty of good empirical work in the human sciences that doesn’t depend on erroneously calibrated p-values. But I do think that unacknowledged variation in effects affects the interpretation of p-values in standard research. And such variation also undermines the interpretation of an average causal effect as representing a mechanism or process. It’s an abstraction, and as you’ve said many times the process could actually operate in different directions for different people. Nutrition research is a good example – yes, lots of universals, but also likely lots of treatment-by-person interactions given vastly different personal developmental histories.

  3. Tom says:

    I’ll agree that humans have some behavioral aspects that can create difficulties but I don’t think that this nullifies the point that Rahul made. Taking the p-value methods developed in agriculture, if you are testing the effect of using fertilizer vs not using it, then I imagine that the signal is going to be greater than the noise by a reasonable amount. However, if you are doing something like testing two different sources of fertilizer and there are limited resources to do this with, then this will put you back in the realms of low signal and high noise, so the question on the usefulness of p-values remains.

    If p-values are ‘bad’ science in one research area I struggle to see what makes them ‘good’ science in others – is it not the principle that is flawed rather than the application?

    • Martha says:

      Tom —

      I’d say it’s the fit of the principle to the application that needs to be considered, rather than the principle or the application separately.

      For example, one case where p-values (or something essentially equivalent such as “more than two standard deviations from the mean”) do seem appropriate is in quality assurance in industry — where you are indeed doing repeated sampling (e.g., a batch of the output every hour, day, or whatever) and making a decision on the results (e.g., literally accept or reject the batch; or shut down the process to see if something needs fixing).

    • Actually, Fisher had something to say about the irrelevance of p-values:

      “It is usual and convenient for experimenters to take-5 per cent. as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. No such selection can eliminate the whole of the possible effects of chance. coincidence, and if we accept this convenient convention, and agree that an event which would occur by chance only once in 70 trials is decidedly” significant,” in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.”

      Fisher in 1937, p. 16 of The Design of Experiments, Second Edition

      To me this sounds like an argument for replication rather than p-values for making decisions.

    • Nick Menzies says:

      If we are talking about decision-making, then a conventional p-value cut-off will only be appropriate in the same way that a stopped-clock will sometimes tell the right time.

      In Martha’s example, I would bet the firms in question are making a calculation where they are optimizing the accept/reject rule based on the frequency of false positive and false negatives produced by possible rules and the implications of each. This will produce a simple rule which might correspond to a particular p-value, but I would just interpret the p-value as a side-effect.

    • Rahul says:

      If p-values cannot be trusted whenever a low SNR exists, fine: Let us develop quantitative rules & guidelines about when p-values are not OK and then use those specific metrics to decide whether p-values are appropriate or not on a study-to-study basis.

      Rather than adopt broad rules depending on subject area. e.g. ok in agriculture but not ok in human studies.

      I think that’s painting with a brush too broad.

      • Martha says:

        “Let us develop quantitative rules & guidelines about when p-values are not OK and then use those specific metrics to decide whether p-values are appropriate or not on a study-to-study basis.”

        This sounds a lot like trying to create “the exact button to push’.

  4. Rahul says:

    I dread that we reach a situation where people post hoc selectively disbelieve p-value based studies when they see conclusions they do not like. e.g. the Bem study or the red-fertility study.

    i.e. the validity of the methods ought to stand independently of the ridiculousness of the particular conclusions.

    • Andrew says:


      Nobody (or, at least not me) is talking about “selectively disbelieving a p-value based studies when they see conclusions they do not like.” The point is the reverse, which is that Bem, Kanazawa, etc., are making implausible claims whose only evidential support is the p-value. My point is that, no, I don’t think that a statistically significant p-value should (in the notorious words of Daniel Kahneman) force me to “have no choice but to accept” these claims.

      That is: I’m not saying, the p-value is wrong, therefore the theories are false. I’m saying that, for all the many reasons we’ve discussed, the p-value does not represent strong evidence in favor of the theory. They can convince me in some other way, fine, either via a direct experiment or with some clean data. Otherwise all they have is a theoretical hypothesis and a bunch of numbers with no clear connection between them.

      • Rahul says:


        Suppose, you had been asked to review the methods section of the Bem ( or Kanazawa ) paper even before he had conducted any experiments at all.

        Would you have outright rejected the study for obviously going to produce too low an SNR?

        • Andrew says:


          What I would’ve done at the time, I don’t know, as I’ve become much more sensitized to this issue during the past few years. But, that said, yes, I would recommend rejection of such papers on the grounds that the measurements aren’t up to the job being demanded of them.

    • Nick Menzies says:

      I think you are implying ‘dont like’ could reflect some inappropriate criteria, e.g political leaning, but it could be reasonable if one thinks about the fit between the study result and existing knowledge.

      If one believes that newly published papers will fall on a distribution in terms of how much they oversell the certainty of their results (via p-values or other means), then — absent any other effects — those that disagree more strongly with the prior will have a higher probability of having substantially oversold their results.

      This doesn’t need to imply that the methods used are bad in general, but it s probably a good thing that the methods (as applied by that particular study) get greater scrutiny.

    • Bill Jefferys says:

      I like to think of this in the context of the fact that it would take a whole lot more evidence to convince me that something like what Bem has claimed is true than for more mundane things like deciding on the guilt of someone on trial for a criminal offense. In class I have entertained the following thought experiment: Someone claims that he can influence a fair coin of yours that you toss (not him, he never touches it) so that it always comes up heads. You toss it and it comes up heads. You toss it again, and again it comes up heads. How many times in a row would you have to do this before you became convinced that somehow this person could actually accomplish what he claims? 10? 100? This can then be used to estimate a personal prior.

      In the case of criminal trials, there’s also a loss since you have to evaluate the loss if the wrong decision is made (either convicting someone who is actually innocent – loss is unjust imprisonment, or acquitting someone who is actually guilty – loss is that another crime might be committed and lack of closure for the victims).

      The loss for falsely believing what Bem or Kanazawa publish, or falsely disbelieving (in case they were actually detecting something real) is less obvious to me.

      • Andrew says:


        The loss for believing Bem is, I believe (without any evidence!), a general degrading of trust in the scientific method. The loss for believing Kanazawa is similar but I’d argue worse in that Kanazawa links his scientific claims to aggressive political stances which can then be picked up by more neutral news outlets such as Freakonomics.

        • Bill Jefferys says:

          OK, but what about not believing Bem (or Kanazawa) if in fact the theory is correct? You need a complete loss function to do a decision-theoretic analysis, whether the error is Type I or Type II.

          That’s what I mean when I say that it is less obvious to me.

          In a trial situation, I can imagine that a juror could consider the losses in terms of (first) how evil it is to exact a penalty against a defendant who is actually innocent. When I taught in Texas, my classes would sometimes come up with loss functions that allowed the death penalty, if the probability of guilt was high enough. Since teaching the same class in Vermont, that has never happened. It may be that it has to do with the fact that Texas is a death-penalty state and Vermont is not; or it may be that I’ve presented this dilemma more clearly in later semesters (the Texas classes that devised loss functions that allowed for the death penalty were in the first few times I taught that class).

          I note this recent situation, which shows how seriously some jurors take the responsibility of making such decisions. I don’t think this juror was thinking formally in a decision-theoretic way, but clearly he was considering these issues informally:

          So in my experience, the Bem/Kanazawa situation as far as losses is concerned is more problematic.

          • Andrew says:


            I agree, there are losses in both directions. Indeed, I suspect this is what motivated the publication of Bem’s work: the ideas were evidently wrong but it looked like he had good statistical evidence. Actually he did not have good statistical evidence, but journal reviewers (and, more generally, the field of psychology and statistics) were much less aware of the importance of the garden of forking paths, back in 2011. In 2011, accepting Bem’s paper was a bad decision but an understandable mistake, given the scientific and statistical standards of the time. Accepting such a paper in 2015 would be a bad decision and less excusable.

            The Kanazawa papers were a different story in that I think there is a lot of innumeracy going on in that subfield, as nobody seemed to notice how absurd it was to claim that certain parents are 26% more likely to have girls. Kanazawa’s theoretical story is only slightly less absurd than Bem’s, but it comported with the reviewers’ sense of what was important in the world, so they were willing to take the statistics at face value without thinking about them too clearly.

            In any case, yes, losses in both directions. To dismiss Bem’s claims if they are in fact true, is to retard the development of ESP research, and who knows how important that could be? To dismiss Kanazawa’s claims if they are in fact true is to retard our understanding of evolution, etc.

            But I don’t like the analogy of statistical hypothesis testing with guilt in a crime because of the weak link between the scientific model and the statistical model.

  5. Phillip M. says:

    I sometimes wonder if it’s the language we use – the singular “p-value” – that causes alot of grief over this. What if we were to dispense with the term ‘p-value’ entirely and opt for using a clearer term/terms that connote its contextual meaning?

    We impose absolutes on something (a probability) which is not. Probabilities are also unit-less. How much rigor in meaning can we truly expect to gain from that, other than what each of our psych differences dictates is ‘acceptable’ or ‘significant’ given a single number?

    The focus on a single ubiquitous value I think misses the boat – alot. The traditional ‘1 in 20 cutoff’ to be significant can hardly apply to all things (yet it is used as if it were as grand as fibonacci numbers observed in nature). And given it (0.05) was an arbitrary construct in the first place – why wouldn’t we consider different domain contexts for what significance *means*?

    This too is problematic, as it goes against the human inclination to reduce research outputs to binary T’s/F’s, pass’s/fail’s, supports/refutes, accepts/rejects, etc etc. I could argue until I’m blue in the face that p-values are merely ‘signposts’ or ‘indicators’ – but unfortunately that doesn’t solve the problem of how interpretations/claims of research results are so heavily dichotomized.

    • Nick Menzies says:

      As has been mentioned upthread, an axiomatic approach would involve decision theory. Where there is a threshold reported, it would be the threshold at which the evidence from the study might make us switch from one course of action to another. For this task a p-value is far from ideal for various reasons, but even if we were restricted to working with the p-value, there would be a new p-value for each decision we need to make. To the extent that the information from a single study might inform multiple different decisions, there would be multiple different p-values which are relevant.

      It is impossible for the authors of a particular study to puzzle through all the different potentially uses of their data and supply all the relevant decision analyses.

      But people want simple heuristics. So we get p-values interpreted as decision thresholds.

  6. dmk38 says:


    I wonder if you really mean this — “something about humans” — literally or if it is just an “attention grabber” headline equivalent to “attention grabber” datavis graphic? (If so, I have no object to that!)

    If we took that idea seriously, we’d presumably be agreeing that human beings don’t operate consistently or as consistently with causal influences as other things that we might choose to measure empirically.

    That’s possible — but it is at least as extraordinary a thing to say as what Bem does about ESP!

    It entails some sort of claim about the uniqueness of human beings that exempts them from the mechanistic laws of nature that apply to atoms, rocks, guinea pigs, etc. Maybe it’s because we have “souls” or “free wills” etc… A familiar view but one that science pretty much sets aside, not b/c it “disbelieves” it but b/c the activity of science is to explain things w/o resort to influences that operate independently of mechanistic natural influences and that defy the sorts of apparatuses we use to specify the characteristics of those influences.

    I think there are two other things going on.

    1st, when it comes to studying humans, we (those who read this blog at least) inordinately interested in things that are complex and admit of multiple plausible conjectures.

    Consider “simpler” human phenomena: if we whack a person in the head, the likelihood of knocking him or her unconscious goes up in relation to the force applied (within some range at least); if a person isn’t fed, the time it will take for him or her to die of starvation will depend on various specifiable parameters including body fat and energy the person is exerting; if *children* eat lead paint, various of their reasoning proficiencies will be impaired; if we treat strep throat in human beings with penicillin, it will abate (or at least would have at some point in the past before strep mutated in way that made it resistant — I don’t know if that has happened actually).

    Whether we use a measurement strategy that features p-values or some other set of statistical tools suited for enabling valid inference from observation, we’ll never expderience any “there’s something about humans” reaction if we stick w/ these “simple” things.

    We’ll start to have that sense only when we look at more complex phenomena that invite a much wider range of conjectures: e.g., do poor people vote for Republicans whose policy positions would screw them over economically & if so why; are “conservatives” closed minded; are women attracted to men b/c of “status” or physical attractiveness; when you give someone a “Cornell” mug that they wouldn’t have paid 50 cents for, will they insist you give them $500 to buy it back from them– & if so why; is the popularity of Honey Boo Boo a sign of the impending collapse of civilization? Etc.

    Those things all involve mechanisms that are even more subtle, even more remote from the sorts of mechanisms that describe why rocks fly through the air in a certain direction when we throw them, or even why human hair grows faster when we cut it regularly. The “something about humans” reaction reflects is a consequence of disregard for the immense, mundane (if consequential) denominator of human-related phenomena that are much easier to nail down consistent with the project of explaining things in the way science does.

    2d & relatedly, those things in the numerator reflect *the imprecision of our current measurement instruments* — not anything “special” about human beings in relation to impersonal, mechanistic causal influences. The operative mechanisms are all “latent variables”; getting empirical insight into how they work requires the development of measures that combine valid, reliable observable indicators; w/ respect to these difficult, complex, distinctively human things, forming valid latent variable measuring instruments is really really hard to do. But that’s not a testament to humans being “different.” That’s a testament to the complexity of the physical process we want to measure and the current imperfect state of our measuring instruments.

    I predict that you will agree with me.

    And that you’ll agree with me that these points don’t have much to do with p-values; that is, the measurement challenges are *in* the phenomena and not an artifact of a particular strategy for operationalizing empirical study of them (although strategies that feature p-values tend to be weak for reasons frequently — as it were — discussed in your blog).

    And if my prediction fails– if you don’t agree with me — that will still reinforce my conclusion that I am right: b/c I’m sure if I had a better way to measure the latent variable that is *you*, I would have gotten the prediction right.

    Yes that’s circular. But that’s b/c the premise “nothing unique about humans” is not an empirical proposition; it is a theoretical assumption of the enterprise of studying human beings empirically.

    • Andrew says:


      No, I don’t think think there’s something unique about the human sciences. What I think is that various problems with measurement tend to be more serious with humans, and so there are some classical statistical methods (those that don’t take measurement problems so seriously) which can work ok in some non-human settings but fail when humans are involved. Really, though, I’m talking about anything with serious measurement issues, whether or not its’ about humans.

  7. dmk38 says:

    You might want to consider this in that case. I’m at wit’s end with this project; nothing seems to work.

Leave a Reply