Skip to content

My talk tomorrow (Tues) noon at the Princeton University Psychology Department

Integrating collection, analysis, and interpretation of data in social and behavioral research

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The replication crisis has made us increasingly aware of the flaws of conventional statistical reasoning based on hypothesis testing. The problem is not just a technical issue with p-values, not can it be solved using preregistration or other purely procedural approaches. Rather, appropriate solutions have three aspects. First, in collecting your data there should be a concordance between theory and measurement: for example, in studying the effect of an intervention applied to individuals, you should measure within-person comparisons. Second, in analyzing your data, you should study all comparisons of potential interest, rather than selecting based on statistical significance or other inherently noisy measures. Third, you should interpret your results in the context of theory, background knowledge, and the data collection and analysis you have performed. We discuss these issues on a theoretical level and with examples in psychology, political science, and policy analysis.

Here are some relevant references:

Some natural solutions to the p-value communication problem—and why they won’t work.

Honesty and transparency are not enough.

The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

And this:

No guru, no method, no teacher, Just you and I and nature . . . in the garden. Of forking paths.

The talk will be Tuesday, December 4, 2018, 12:00pm, in A32 Peretsman Scully Hall.


  1. Steve says:

    If you don’t mind a comment from a non-scientist, but avid reader of your blog, I never quite agree with your point that honesty and transparency are not enough. I get the point that the problem is not necessarily a question of the personal ethical failure of the researcher and that sometimes studies fail because the design is poor. However, at a certain point ethical standards have to be raised sufficiently high to prevent researchers from representing their research as sound when it is not. That is a question of integrity too. If a researcher does not know enough to know that her design is poor or her measurements too noisy to extract a meaningful conclusions, then honesty requires that she not report the results as valid or reliable. There are just too many scientists, particularly but not only in the social sciences, that do not have the expertise to use certain statistical methods to extract data. Or do not know enough about the instruments that they are using to measure, to know how noisy they may be. These researchers have to be honest about their lack of knowledge and consult with statisticians before they design their studies. Otherwise, they are just guessing that the methods they employ will yield valid results. In the professions of medicine and law, it is part of the ethical requirements of the profession that a professional must not offer opinions outside their particular area of knowledge. That is, I fear, not a rule that many academics abide by, but they should and that is also about honesty and raising the ethical standards of good research.

    • Andrew says:


      I think you’re arguing that honesty and transparency are necessary but not that they are sufficient.

      • Steve says:

        What I actually have in mind is Peirce’s pragmatism that holds, “if . . . what we think is to be determined in terms of what we are prepared to do, then surely logic, or the doctrine of what we ought to think, must be an application of the doctrine of what we deliberately choose to do, which is Ethics.” In short, all of your choices of how researchers both report their results and design their studies come down to choices about how to get at the truth or they come down to a choice to make other goals the priority. Along the way, one can forgive a researcher for choosing a method that later proves unreliable. But, once it is clear that the method fails, then honesty requires it be abandoned (at least in the setting it was found unreliable). At root scientists are making decisions on how to act, and that has its foundations in ethics. Perhaps, the point is too philosophical. But, I thought that there was a practical point, which is that today’s research “failures” need to become tomorrow’s unethical behavior.

        • Keith O’Rourke says:

          Steve: > once it is clear that the method fails, then honesty requires it be abandoned
          On the other hand, maybe its more of a don’t know that they don’t know. A couple weeks ago I was talking to a few recent science grads who had taken a few stats course and either already published papers or are in the process of doing that. When I asked what they thought a p-value was – they all said they were taught it is the probability of the null hypothesis being true.

          > consult with statisticians
          Few consulting experts seldom want to know more than the person they conferring with is an expert and will act in the clients best interest. This is true of lawyers in the degree that they have to pass the bar and face penalties for not acting in the clients best interest. Apart from statisticians not having either, conferring scientist will unlikely have funds to do more than ask for quick advice that they will have to implement on their own. Sort of like having a couple hours to get legal advice and then run your own defence.

          Anyway, nice quote from Peirce (reference?). You might like this post here and

  2. Great article Keith. I have to admit that I was scratching my head in my Intro to Stats class. I have a hard time thinking in dichotomies anyway. Probably b/c false dichotomies are so prevalent in nearly every discipline.

  3. Steve says:

    Yes, there will always be a gap between the ideal and actual practice, but if we don’t raise standards over time, we will just end up back where we are. I would think that if a norm arose that researchers employing sophisticated statistical methods or unproven experimental designs (or pick your potential methodological pitfall) had to consult a statistician and that person was to be named in the study that at least there would be some representational pressure to improve scientific methods. I have read your posts on Peirce before (a much underappreciated philosopher). I would love to see you post something explaining to a non-statistician how your Bayesian views are compatible with what I take to be Peirce’s hostility to such views (or maybe a post explaining how I have misunderstood Peirce).

  4. Davy says:

    I think this statement is *way* too strong: “in studying the effect of an intervention applied to individuals, you should measure within-person comparisons.”

    I think it is too strong for two reasons:
    1. As Uri Simonsohn has shown (, within person analyses can actually hurt statistical power in some cases. This was surprising to me at first, but it is totally right.

    2. Many things that psychologists study have to be done between people. For example, in Kahneman and Tversky’s representativeness work, they ask some people how many murders take place each year in Detroit and other people how many murders take place each year in Michigan. They find that people think there are more murders in Detroit than in Michigan. If you ask people both of these questions (within person), then they don’t “fall for the trick”. This is an example where doing something between subjects is good practice. Same thing with the famous Asian Disease Problem.

    If I were you, I would soften the claim to be something more like, “it is often the case that within-person analyses are doable and can increase statistical power.”

    • Andrew says:


      Regarding your second point: Sure, there are tradeoffs, and there are some examples where it’s difficult or even impossible to do two treatments on each person, so that the crossover design is not feasible. Another example would be certain medical studies such as heart surgery where you could do treatment A or treatment B on a person but not both.

      Regarding your first point: No. The problem in that example you link to is not the within-person comparison but the data analysis in which comparisons are done of gain scores. The right thing to do is not to take averages of after – before. The right thing to do is to regress after on before. If correlation between before and after is low, then the regression coefficient will be close to zero. This is actually related to an example that I’ve discussed in many of my classes (the sham treatment in the chick-brains example, for any students or former students reading this).

      In summary: I recommend within-person comparisons, but I agree that in some settings they are not always possible. The point is that within-person comparison should be the default, and people should need a good reason if they want to study within-person effects using between-person comparisons.

      • Davy says:

        Thanks for the reply. It sounds like we basically agree on point #2. I would just add that I think it is very common in psychology to be worried that a within-subjects design simply isn’t very feasible. Perhaps one of the most common questions in a psych seminar is “do you think you would find this same effect if you did it within person?”.

        Regarding point #1… yes, you are right. I just went and reread the Simonsohn post and he actually mentions this as well. If you control for the earlier IAT score (rather than difference it out), it can’t reduce power. Thanks for helping me see this. I guess the only thing I would still add is that if the measure you are controlling for is noisy, it is surprising (at least to me) how little it improves your statistical power. If you have a super noisy measure, doing a within-subject design can help, but still won’t get you very far. Like putting lipstick on a pig…

  5. Fritz Strack says:

    I completely agree with Andrew, except for his unexceptionally advocating within-participants measures.

    Of course, this is an economic way to reduce error variance originating from interindividual differences. In my field (social psychology), there is a basic insight saying “every measurement is a treatment”, meaning that every subsequent assessment is influenced by its predecessor. This may occur in different ways. For example, participants may be more likely to become aware of the purpose of the study and be a good (or a bad) subject.

    • Fritz Strack says:

      What I want to say is that within-person measurement is not only a matter of feasibility but also, and perhaps more important, a question of validity.

      • The only threat to validity I see is the validity of the model, if your statistical model pretends you didn’t do the first measurement/treatment then it fails. Psychologists, and all scientists, need to start thinking about building models of processes, instead of pretending they are studying rngs. This is the biggest overarching problem that frequentism created.

  6. Marcus Crede says:

    This is probably a silly question but was Susan Fiske in attendance for your talk?

  7. Donn says:

    This may be somewhat off topic, but have you ever commented on or critiqued Grice’s Observation Oriented Modeling? It seems this approach may deal with some of the often problematic assumptions of Bayesian and frequency type stats…….

Leave a Reply