Why is the scientific replication crisis centered on psychology?

The replication crisis is a big deal. But it’s a problem in lots of scientific fields. Why is so much of the discussion about psychology research?

Why not economics, which is more controversial and gets more space in the news media? Or medicine, which has higher stakes and a regular flow of well-publicized scandals?

Here are some relevant factors that I see, within the field of psychology:

1. Sophistication: Psychology’s discourse on validity, reliability, and latent constructs is much more sophisticated than the usual treatment of measurement in statistics, economics, biology, etc. So you see Paul Meehl raising serious questions as early as the 1960s, at a time in which min other fields we were just getting naive happy talk about how all problems would be solved with randomized experiments.

2. Overconfidence deriving from research designs: When we talk about the replication crisis in psychology, we’re mostly talking about lab experiments and surveys. Either way, you get clean identification of comparisons, hence there’s assumption that simple textbook methods can’t go wrong. We’ve seen similar problems in economics (for example, that notorious paper on air pollution in China which was based on a naive trust in regression discontinuity analysis, not recognizing that, when you come down to it, what they had was an observational study), but lab experiments and surveys in psychology are typically so clean that researchers sometimes can’t seem to imagine that there could be any problems with their p-values.

3. Openness. This one hurts: psychology’s bad press is in part a consequence of its open culture, which manifests in various ways. To start with, psychology is _institutionally_ open. Sure, there are some bad actors who refuse to share their data or who try to suppress dissent. Overall, though, psychology offers many channels of communication, even including the involvement of outsiders such as myself. One can compare to economics, which is notoriously reistant to ideas coming from other fields.

And, compared to medicine, psychology is much less restricted by financial and legal considerations. Biology and medicine are big business, and there are huge financial incentives for suppressing negative results, silencing critics, and flat-out cheating. In psychology, it’s relatively easy to get your hands on the data or at least to find mistakes in published work.

4. Involvement of some of prominent academics. Research controversies in other fields typically seem to involve fringe elements in their professons, and when discussing science publication failures, you might just say that Andrew Wakefield had an axe to grind and the editor of the Lancet is a sucker for political controversy, or that Richard Tol has an impressive talent for getting bad work published in good journals. In the rare cases when a big shot is involved (for example, Reinhart and Rogoff) it is indeed big news. But, in psychology, the replication crisis has engulfed Susan Fiske, Roy Baumeister, John Bargh, Carol Dweck, . . . these are leaders in their field. So there’s a legitimate feeling that the replication crisis strikes at the heart of psychology, or at least social psychology; it’s hard to dismiss it as a series of isolated incidents. It was well over half a century ago that Popper took Freud to task regarding unfalsifiable theory, and that remains a concern today.

5. Finally, psychology research is often of general interest (hence all the press coverage, Ted talks, and so on) and accessible, both in its subject matter and its methods. Biomedicine is all about development and DNA and all sorts of actual science; to understand empirical economics you need to know about regression models; but the ideas and methods of psychology are right out in the open for all to see. At the same time, most of psychology is not politically controversial. If an economist makes a dramatic claim, journalists can call up experts on the left and the right and present a nuanced view. Ta least until recently, reporting about psychology followed the “scientist as bold discoverer” template, from Gladwell on down.

What do you get when you put it together?

The strengths and weaknesses of the field of research psychology seemed to have combined to (a) encourage the publication and dissemination of lots of low-quality, unreplicable research, while (b) creating the conditions for this problem to be recognized, exposed, and discussed openly.

It makes sense for psychology researchers to be embarrassed that those papers on power pose, ESP, himmicanes, etc. were published in their top journals and promoted by leaders in their field. Just to be clear: I’m not saying there’s anything embarrassing or illegitimate about studying and publishing papers on power pose, ESP, or himmicanes. Speculation and data exploration are fine with me; indeed, they’re a necessary part of science. My problem with those papers is that they presented speculation as mature theory, that they presented data exploration as confirmatory evidence, and that they were not part of research programmes that could accomodate criticism. That’s bad news for psychology or any other field.

But psychologists can express legitimate pride in the methodological sophistication that has given them avenues to understand the replication crisis, in the openness that has allowed prominent work to be criticized, and in the collaborative culture that has facilitated replication projects. Let’s not let the breakthrough-of-the-week hype and the Ted-talking hawkers and the “replication rate is statistically indistinguishable from 100%” blowhards distract us from all the good work that has showed us how to think more seriously about statistical evidence and scientific replication.

97 thoughts on “Why is the scientific replication crisis centered on psychology?

  1. I think you’re leaving out two big ones: psychology has two quite well-replicated areas, of behavioral genetics and IQ research. But no one wants those to be true, so they’ve spent decades swimming in the opposite direction as fast as possible. It sets them up to find a lot of things that just ain’t so.

    • Great point! To see how different the credibility of behavioral genetics is (compared to, say, social psychology) look at this recent review by Plomin & colleagues: http://scottbarrykaufman.com/wp-content/uploads/2016/05/2016-plomin.pdf . As Spotted Toad says, people just don’t LIKE these hard and replicable large-n-study findings on the importance of genes and the unimportance of shared family environment. That in turn creates a market for stuff with a happier message suggesting people are very malleable, problems disappear if you change your attitude, blah blah. Several of the bad apples you mention (Dweck, Bargh, etc.) are obvious examples, and there are many others among the many studies recently found to be unreproducible.

      Scott Alexander has some fascinating blog pieces on this basic topic, see e.g., http://slatestarcodex.com/2016/08/25/devoodooifying-psychology/

      • Fundamental problem with behavioral genetics studies that use fraternal vs. identical twins raised together (the vast majority):

        Identical twins Fraternal twins
        IQ1 IQ2 IQ1 IQ2
        120 120 120 140
        100 100 140 120
        90 90 100 90
        105 105 90 100
        Above data using behavioral genetics methods: 0 environment, 100% genetic

        Now, I am going to poison my last two pairs of twins with neurotoxins (lead; they all lived in Flint and had shared environmental exposure to lead)

        Identical twins Fraternal twins
        120 120 120 140
        100 100 140 120
        80 80 90 80
        95 95 80 90
        Above data using behavioral genetics methods: 0 environment, 100% genetic
        Clearly LEAD had a substantial negative environmental influence on IQ but the analysis does not allow for the shared environmental factors to show up.

        Am I not right about this, Andrew?

        • p.s. somewhat more sophisticated methods provide heritability estimates in the 50%+ range. I didn’t mean to be quite so harsh re: all behavioral genetics. I was reacting more to the right-wing media interpretation of these findings rather than Spotted Toad.

        • You seem to be arguing, that twin studies are not good for estimating IQ heritability, it’s fairly common practice among IQ research skeptics, but is unwarranted: we obtain results from fraternal and identical twins studies, but also from studies making use of twins separated at birth, adopted twins, twins reared apart and so on.
          If anything, it’s incredible, that so much can be caught through imperfect means (i.e. the genetic effect may be larger); after all, they’re not perfect at differentiating between genetic effects and those of, say, prenatal, perinatal and postnatal “organic” effects. However, with GWAS studies, scientists are slowly pinpointing specific combinations of genes with seem to explain a growing percentage of some important outcomes (e.g. number of years of education — it seems to be around 9% of variance explained; yes, it’s a crude measure, but again — it’s impressive we’re able to capture it!).
          What’s more, there’s an important element of the puzzle missing — the brain. Right now it seems to be genes -> behavior, but it’s pretty much genes -> brain -> behavior. Once we’ve got some neuro equivalent of GWAS we’ll have much better estimations for how much biology is involved in the whole issue.

        • While one would expect the separation of twins to help get at genetic effects they still suffer from common pool problems. You need to not only separate but send them to massively different cultural experiences. This rarely, if ever happens. That’s probably because you’ve got genetic factors working in cooperation.

          All you need are a few fundamental traits to be genetic for genes to have wide ranging social effects. An easy one to pick is height. How different is a person who’s tall treated compared to one who is short across cultures and time? A whole host of behavioural commonalities will end up being due to that one genetic factor that’s going to be more common in identical twins but completely independent of the kind of environmental separation researchers can find or create. The universality of these effects is also probably due to genetic factors (genetic predisposition to favour those who are more physically fit).

        • I’m not sure we disagree, so I’ll just add, that there have been some studies (low samples, sure, but they don’t exist in a vacuum) into interracial/intercontinental adoption etc. — this is as far as we can do in regards to massively different cultural experiences. Another way to do this is by following children adopted by parents of high socioeconomic status (i.e. in principle adopted children come from families incapable or unwilling to take care of their offspring in a manner, which would maximize their well-being). I’m not sure we should be worried about that not being different enough.

        • You are right. The thing is, that heritability is often misinterpreted as something that is an inherent property of the genes, but it is always defined as relating to a certain environment. If you change the environment, you change the heritability. Heritability measures how much of the _variation_ in the phenotype is due to genes, and how much is due to environment.

          To illustrate:
          Imagine one dystopian society where all children need to follow a strict nutrition program very closely. Because the environment that is relevant for height is basically identical for all people so height heritability will be very high.

          Now imagine another province in this dystopian country, where the legs of all newborns are cut off at a random point. In this province heritability of height would be much lower, by 50% or so.

      • I think contrasting Plomins list with the failures of social psych is ridiculous. Most of the insights listed by Plomin are very broad effects or general insights, clearly selected with a focus on what is most reliable and created by an insider with benign intents. Most effects that are under fire in social psych are rather flimsy to begin with, and many many researchers already doubted them when some who were probably even doubting more started to replicate these effects. And they did not succeed. Keep in mind: the effects selected for replication in social psych were not a random sample of all published effects, but a negative selection, whereas the Plomin list is a positive selection.

        If you want to, you can come up with a comparable list from social psych. It is ironic that many of the failed replications hail from particular sub-fields of psychology that are not even particularly social in nature, like embodiment and self-control research.

        • What would your comparable list for social psychology be? Not a rhetorical question.

          I agree completely that Plomin’s list (or other attempts to summarize the consistent findings of behavioral genetics or psychometrics) is about broad trends and general insights. Nonetheless, we should be able to turn those broad trends and general insights into some kind of informative prior for future research, rather than approaching everything like a newborn baby. (Though to be fair to newborn babies, they probably have lots of informative priors they use to decode the world.)

      • Behavioral genetics was already born dead. Apart from flawed assumptions and massive confounding, it’s most fatal problem is its uselessness. Heritability estimates might have been of interest in a time when people thought genes determine traits and these traits are unchangeable. Today we know better. Heritability estimates say nothing about the malleability of traits. And we don’t need them to know that genes are involved in making traits, that’s basic developmental biology.

        The premise of behavioral genetics is antiquated – to oppose those two seemingly fundamental forces that shape humans: heredity/genes and environment. First, in humans, genes are not the only substrate of inheritance, maybe not even the most important one. Cultural inheritance is massive. Second, genes and everything else that’s lumped into “the environment” causally interact in development, so it simply doesn’t make sense to try to separate the contribution of genetic vs non-genetic causal factors.

        It’s time to end this complete waste of time.

        • a) Compared to most aspects of social and biological science, behavioral genetics has been barely funded and practically scoured from academic departments, especially in the United States. Calls to “end it” just seem like repression of its findings, rather than any kind of reapportionment of scarce resources.
          b) You made a lot of claims, but not any testable ones.
          c) To me, it seems informative that a large portion of variation in outcomes we care about in our society are genetically determined. Yes, this doesn’t mean they are genetically determined in any conceivable set of circumstances: drop a baby in the middle of the Amazon and its income will likely not be the same as if it grew up in Scarsdale, putting aside the intra-uterine environment. Nonetheless, the findings of behavioral genetics should make us much more skeptical of many Freakonomics/Malcolm Gladwell/Little Things That Make A Big Difference-style stories, where tiny interventions utterly transform later outcomes. And the fact that we can now predict a non-trivial portion of outcomes like educational attainment directly from genotype is a major change in the world.

    • Psychologists have *many* replicated areas (in perception, psychophysics, memory, and more). For some distinct examples:
      –automatic processes such as word reading interfere with tasks that are not automatic– stroop effect
      –muller-lyer illusion
      –influence of top-down knowledge (e.g., words) on perception (e.g., letters)
      –semantic priming
      –distinction between implicit and explicit memory
      –the zeigernik effect

      ETC (this just took me about 20 seconds).

      • I’m not saying we don’t have some VERY real problems. But the right has basically been pushing the “one thing we know is that IQ is inherited” business for decades now. And, see below, we don’t really know that. In fact, we know that lots of environmental factors (micronutrients, lead, stress to name a few) have substantial impact on intellectual development.

  2. As for clinical research – I do believe the field gets partial protection (inadvertently) from groups like the Cochrane Collaboration https://en.wikipedia.org/wiki/Cochrane_(organisation) as their product is sorting out and relaying what _the evidence is_ and they can’t really afford to put it in too poor a light e.g. their motto is “Trusted evidence. Informed decisions. Better health.”

    “Varying, uncertain and limited evidence. Informed decisions. Better health.” does not really sound as good.

    In particular “Key criticisms that have been directed at Cochrane’s studies include … an excessively high percentage of inconclusive reviews” which I believe should be take more as a compliment than criticism of their work. But the organisations do need to ensure their survival In some cases Cochrane’s internal structure may make it difficult to publish studies that run against the pre-conceived opinions of internal subject matter experts – https://web.archive.org/web/20140905044321/https://www.radcliffehealth.com/sites/radcliffehealth.com/files/books/samplechapter/5853/Gotzsche%20chpt%2012-45f64580rdz.pdf

    Overall they may be doing more good than harm…

  3. Isn’t it also the case that psychological research translates (often misleadingly and inaccurately) into some kind of “takeaway” that applies to people’s everyday lives? That’s part of why the power pose got so much attention, I think; supposedly anyone can sit up a little taller and enjoy a more successful life.

    In other words, it isn’t only that it’s of general interest; it also translates quickly and badly into a product for personal use.

  4. Individual differences psychology (IQ, personality) and behavioural genetics have indeed quietly tackled their own reproducibility issues long ago. Psychophysics (perception, decision making) never had anything resembling a crisis, due to a combination questions the area pursues, methodological rigour, and careful experimental control. These areas continue to do just fine, but get little attention these days.

    Cognitive neuroscience, on the other hand, with its small samples, enormous search space, lack of intuitive theory, aim for real-life impact, and huge costs of running studies, is probably producing tons of research that wouldn’t replicate if we had the time and money to try to. At least that’s what I’m fearing, standing in the middle of the field and looking around myself.

    • Psychophysics did have a serious fundamental crisis in the 90’s (to name one). Prior to that it was strongly believed that low level perception could not be modified by learning. All of a sudden the whole psychophysical literature was rediscovering all the fundamental principles of learning like they have never head of them before.

      And while mentioning learning, their biggest one was probably late 60’s and early 70’s discoveries that learning is very specific to a species innate propensities and the rediscovery of stimulus substitution. Even to this day researchers use levers for rats design to obfuscate what a rat really wants to do with the lever. Every once in a while this gets rediscovered by someone and they’re reminded what the rat is doing isn’t what over 90% of the articles say it is.

  5. I have the feeling that science journalists have been going a little easy on biomedical research. I mean, how many people know about the two drug companies reporting that > 80% of preclinical cancer results they’ve followed up on don’t replicate (often even after 10 tries)? And the frequent use of unverified cell lines. Perhaps biomedical science journalists derive their own prestige from their association with academic science, and are (with a few exceptions like Sharon Begley) a little reluctant to go full bore on exposing the magnitude of all this?

  6. Couple of typos to fix, I think, Andrew:

    “at a time in which min other fields” should ‘min’–>’in’?

    “Ta least until recently,” should ‘Ta’–>’At’?

  7. Individual differences psychology (IQ, personality) and behavioural genetics have, as Spotted Toad says, indeed quietly tackled their own reproducibility issues long ago. Psychophysics (perception, decision making) never had anything resembling a crisis, due to a combination of questions the area pursues, methodological rigour, and careful experimental control. These areas continue to do just fine, but get little attention these days.

    Cognitive neuroscience, on the other hand, with its small samples, enormous search space, lack of intuitive theory, aim for real-life impact, and huge costs of running studies, is probably producing tons of research that wouldn’t replicate if we had the time and money to try to. At least that’s what I’m fearing, standing in the middle of the field and looking around myself.

  8. All of the mentioned factors are relevant, but let me venture a hypothesis (which may well not be true). Let’s start with the premise that all these fields are indistinguishable with regards to how frequent or serious the errors are. Without real evidence, I am willing to assume that – I don’t really think psychologists have poorer training than economists or pharma researchers, etc.

    What is missing from the list is what I believe to be the biggest factor, the elephant in the room so to speak. It is MONEY. Pharma and economics involve much more money tied to research than psychology. That means that reputations carry larger monetary consequences. It also means the payoff to conduct poorly constructed research – and to keep the data from being publicly available – are larger in these areas. Hence we find more opposition to acknowledging the problems or doing much about them. At the risk of oversimplification, I think this is the heart of the difference.

    But I’m open to other ideas.

    • Economics–not really, except for financial economics. Consider as a random example the first article in the current issue of the top journal, QJE: “Field of Study, Earnings, and Self-Selection”. This paper basically finds that once you properly account for student ability and field of study, going to a prestigious university doesn’t increase your expected lifetime income. I don’t see much money in that.

    • I’d be kind of scared to say that Merck killed a bunch of people with its Vioxx drug because Merck has a giant amount of money to make life hell for its critics. I’m not saying that Merck would, just that Merck is so immensely wealthy that, now that I think about it, I’m kind of terrified I even mention the word “Merck.”

      In contrast, my making fun of social psychologist Susan T. Fiske seems pretty low risk.

  9. Not specific to psychology, but this discussion hit the mainstream press yesterday:

    The article discusses power pose, lack of power in experiments and that “scientists are incentivised to publish surprising findings frequently”. No mention of the garden of forking paths, though.

    Many of the comments on the Guardian site are depressingly anti-science.

  10. It’s a bit off-topic, but I was surprised to see Carol Dweck quoted with Fiske, Bargh and Baumeister. Sure, there was this post a year ago (http://statmodeling.stat.columbia.edu/2015/10/07/mindset-interventions-are-a-scalable-treatment-for-academic-underachievement-or-not/), but even then I had the impression (at least by a quick reading of her papers) that her research is generally more serious and less garden-of-forking-pathish than that of Fiske, Bargh,… Are there failed replications of her work? If yes, could someone put a link to them?

  11. I’m expecting something of a replicability earthquake to occur soon in clinical research, but there is a lot of resistance to admitting that there is a problem. Awhile back I thought it might be good to get ahead of the curve and arranged a meeting with one of the leaders in burn research, where I mentioned what was going on with replicability in other fields and the likelihood that at least a third of published results were probably not reproducible — leading up to suggesting that we might get organized and undertake some replication studies of major publication which were driving clinical practice and commence a frank discussion of the problems. The result was a tantrum where I was assured that EVERY publication, randomized trials and retrospective studies, even every grant proposal where a power analysis had suggested a particular result could be observed, was completely valid, and any failure to reproduce it was a failure on my part, and my future employment was threatened if I didn’t get with the religion of what they were doing. I opted to move on, as clearly reputation and infallibility trumped good science in that case. I’d anticipate a lot of this sort of thing as this shakes-out over the next decade or more.

    • I think you’re right about this. There are some of the same things going on as in psychology: studies conduced by small groups of enthusiasts, lack of understanding of statistical methods, traditional ways of doing things, the need for “significance” for publication, forking paths, publications being key to career advancement and so on. To be fair, there has been a lot of effort over the years to promote understanding of methodological problems and try to get people to avoid them, and that has had a fair impact on some areas (notably clinical trials, systematic reviews), but lots of problems remain. When you get down to lab and small scale clinical studies done by clinicians, I think you’ll find most of the familiar issues in abundance. There is important work to be done here I think – this stuff is important, people’s lives depend on it (and I mean the patients, not the doctors and researchers!).

      • A source of the problem in the clinical research world is that medical students are intensively taught that they are “physician-scientists” and should be actively doing research, compounded with the way surgeons are indoctrinated with the notion that they’re something like the captain of a ship (perhaps “deity”), and thus tend to try to control whatever they’re involved in, and the result is clinicians driving research (NIH proposals get reviewed by MDs, who look for MDs among the investigators), who lack the education in the scientific method that Ph.Ds get (which still leaves something to be desired, but is getting better). This is not to say that there are no great & competent MD researchers — I’ve worked with quite a few — and I’m not suggesting the Ph.D is the golden ticket to being a good researcher (sometimes it seems they pass-out doctorates like prizes in a box of cereal, often for “time served”). I don’t see any magic bullet solutions other than to get everyone involved to frankly discuss the issues, and in the context of that awareness, make better choices.

  12. Andrew states:

    And, compared to medicine, psychology is much less restricted by financial and legal considerations. Biology and medicine are big business, and there are huge financial incentives for suppressing negative results, silencing critics, and flat-out cheating.

    I am very skeptical of this view as explained previous comment in this blog.
    First, it confuse financial magnitude with importance. For example, $10M at stake for a company like GE is small potatoes compared to getting tenure or not for a budding academic with 2 kids and a mortgage. (Of course fledgling companies may be in the same boat, but then the problem is not "Big Business".)
    Second, businesses risk huge liabilities for marketing wrong products. In contrast, researchers have tenure, and seldom if ever face real consequences for shoddy research. Ex ante, businesses have big incentives to "kill" the candidate therapy. Only when it passes stringent tests does it go to market. Ex post, if an error is discovered, they may try to hide it. The story about Big Business is nuanced.
    Third, companies are regulated by third parties (FDA, SEC, etc.). This means that (a) research has to meet certain standards, and (b) there are external audits / monitoring / licensing (e.g. Theranos). Academics are typically self-regulated. Arguably the regulated and the regulators (e.g. editors, academic societies, etc) are often one and the same. In fact, the incentives are such that those most in need in regulation are most likely to end up as regulators. Go figure.
    I could go on. But I would just say this. Consider an honest academic declaration of conflict of interest:

    My salary, grants, promotions, professional standing, and career all depend on publishing significant findings frequently. To the best of my knowledge and abilities, I declare these incentives have in no way influenced the integrity of the present research.

    A case can be made that, faced with these incentives, placing academia in charge of discovering truth is like putting the wolf in charge of the chicken coop. Does science belong in academia? Should academia just focus on teaching? What are other outlets for scientific activity?

    • Fernando said, “Third, companies are regulated by third parties (FDA, SEC, etc.). This means that (a) research has to meet certain standards, and (b) there are external audits / monitoring / licensing (e.g. Theranos)”

      But how good is the third-party regulation? My impression is that the government regulators such as FDA and SEC are themselves often constrained by restrictions or requirements mandated by legislation or executive actions. In addition, the regulation is often in terms of strict protocols, which often are not best practices and can often be “gamed.”

    • From my first hand experience, the pharma regulated research is consistently much higher quality than academic only (i.e. not involving collaborations that entail regulatory oversight) – with a few exceptions because there are some exceptional academic groups.

      The biases of pharma are well understood and yes sometimes they do cheat (and have been caught) but unless they are willing to break the law – one can be reasonably confident in what they say they have done (and that sometimes does get audited). But what I learned is that if I don’t personally know the academics and have first hand knowledge of their sense of what careful good research requires – I just don’t believe anything they say they have done.

      Martha asked: But how good is the third-party regulation?
      That does vary and sometimes it is quite poor. For instance as a former FDA director said at the Boston JSM “We got funds to staff up on statistics. At the time, there were few (almost no?) qualified people we could hire. But if we didn’t hire we would have lost the positions. We are still trying to recover from that.”

  13. During my undergraduate, I did a pretty thorough comparison of research methods + publication in psychology, sociology, and economics.

    This (esp #1 – #3) is spot on.

    P.S. @Dale, pretty sure Andrew mentioned money.

    > And, compared to medicine, psychology is much less restricted by financial and legal considerations. Biology and medicine are big business, and there are huge financial incentives for suppressing negative results, silencing critics, and flat-out cheating. In psychology, it’s relatively easy to get your hands on the data or at least to find mistakes in published work.

  14. How about another hypothesis:

    There are astonishingly few interesting (e.g. not obvious) psychological regularities so virtually all interesting published psychological generalizations will be false.

    Unlike virtually every other soft science the human brain has been under evolutionary pressure for millions of years to accurately predict the behavior of other humans. Indeed, many people believe that this pressure is so strong that it was responsible for our rise about the other primates.

    As a result any simple rule useful for predicting non-trivial behavior of individuals or small groups will feel extremely obvious to us. After all we regularly make complex psychological predictions without a second thought even when we have to put ourselves in considerably different shoes.

    Sure, there will be some insignificant (as far as cooperating/competing with someone) generalizations, e.g., people tend to pick the right object X% of the time when asked to select between two identical objects and make up reasons for their choice, and we might also expect some generalizations about crowd behavior that wasn’t present in the evolutionary environment but almost everything else should either be a very subtle, complex effect or really obvious.

    • For example, it’s not uncommon for the opening sentence of a famous novel to have offered a hypothesis about psychology that has since been debated at length:

      – “It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.” —Jane Austen, Pride and Prejudice (1813)

      – “Happy families are all alike; every unhappy family is unhappy in its own way.” —Leo Tolstoy, Anna Karenina (1877; trans. Constance Garnett)

      – “I wish either my father or my mother, or indeed both of them, as they were in duty both equally bound to it, had minded what they were about when they begot me; had they duly considered how much depended upon what they were then doing;—that not only the production of a rational Being was concerned in it, but that possibly the happy formation and temperature of his body, perhaps his genius and the very cast of his mind;—and, for aught they knew to the contrary, even the fortunes of his whole house might take their turn from the humours and dispositions which were then uppermost:—Had they duly weighed and considered all this, and proceeded accordingly,—I am verily persuaded I should have made a quite different figure in the world, from that, in which the reader is likely to see me.” —Laurence Sterne, Tristram Shandy (1759–1767)

      – The past is a foreign country; they do things differently there. —L. P. Hartley, The Go-Between (1953)

  15. “Notorious” is the word you use for things like power pose. “air pollution in China which was based on a naive trust in regression discontinuity analysis, not recognizing that, when you come down to it, what they had was an observational study”

    Is that a little harsher than warranted? I thought there was a discontinuity, no? And that regardless of the polynomial degree — including 1 — the model shows a substantial effect. Why don’t you feel you’ve learned something causal from the analysis?

    • The policy was discontinuous in space, and time. But the pollution wasn’t discontinuous in space (it’s blown around by the wind!) and the duration was so long that there was plenty of time for new self-selection equilibria to establish. There is plenty of reason to think that those people sensitive to air pollution might have moved away from the pollution, and therefore had children who were more sensitive to pollution in some other region, and that the children of those who remain might be less sensitive to pollution, and that there would develop through time differences in health care (more asthma doctors in some areas for example), differences in population, differences in education, differences in economic status (some people are getting free heating!) etc that are the real causes of whatever is observed. So rather than identifying an effect of pollution on health the study at best was identifying an effect of a lifetime of interrelated policies and dynamic human choices on a wide variety of social outcomes.

    • L:

      There were lots of problems with that paper. But the quick answer is that it was an observational study in which they provided no evidence that they included enough covariates so that the treatment and control groups would be comparable. Yes, there is a discontinuity, but discontinuity is not magic: the populations living north and south of the river could vary in all sorts of ways.

      Did I learn anything causal from the analysis? No, not really. I already thought that air pollution is bad for life expectancy. Getting a noisy estimate based on a super-crude observational study tells me essentially nothing beyond that.

      • Economists need to stop insisting that their causal inference methods are definitive. There are often alternative explanations, as Andrew suggests above. Regression discontinuity is a fad. There; I said it.

  16. When you say crisis in psychology, you really mean social psychology (which you mention parenthetically way down in your blog post). You are missing, however, the fundamental factor as to why social psychology doesn’t replicate, and it is very simple–the low n of nearly all experiments. A p-value for 1500 observations is very different than that for 50 (Lindley’s and others observation), and low n makes what you call the garden of forking paths (multiple hypothesis testing) very simple. Political scientists have the ANES (thousands of respondents) plus most of their other data have a good deal of regularity due to either aggregation or comparisons to previous behavior (say voting records of congressmen). What is really needed in social psychology is an ability to leverage large n in experiments (say like Facebook). People get angry about that (ACHE committees and such) but there’s nothing to prevent non-academic researchers from trying that, and that is where progress in social psychology will come from if it happens at all.

    Regarding your comment “1. Sophistication: Psychology’s discourse on validity, reliability, and latent constructs is much more sophisticated than the usual treatment of measurement in statistics, economics, biology, etc” I couldn’t disagree more. This “sophistication” is basically an attempt to get results where obvious comparisons don’t show them. Once one has a large number of highly correlated variables, letting “sophisticated” methods loose on them without a clear theory (and as in your previous post about Fiske, you
    state “[t]heir substantive theory is so open-ended that it can explain just about any result, any interaction in any direction.” This is sophistication? It’s more like theology in its non-falsifiable characteristics (though theology is very sophisticated…) Or consider your comment:

    This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

    Once again, sophistication? By this definition of sophistication, the Duke cancer scam (http://www.nytimes.com/2011/07/08/health/research/08genes.html) was sophisticated. By Fiske’s definition,
    Baggerly and Coombes were methodological terrorists.

  17. I think economics largely avoids this because
    1) Everyone, applied or not, is expected to have very rigorous understanding of statistical technique, and of theory. “Methodologists” like theorists and econometricians are not shuffled to the side, but rather are held in very high esteem. It is not even controversial to suggest that economists have much better technical training, both before their PhD and during, than psychologists, particularly social psychologists.
    2) The idea of low-N experiments simply doesn’t exist. The closest we have to this are cross-country regressions which have been very much out of style for decades precisely for the low-N/unobserved heterogeneity.
    3) Economists are fairly “hostile” (some might say “jerks”) in how they treat research. The goal when refereeing, editing, training, etc. is to produce better research, period. Seminars are organized so that there are tough questions from minute one. There are regularly papers by big stars which contradict their earlier work should the data suggest it. Deep and fierce disagreements about methods are not just common, but are published in the top journals. I disagree with Romer’s essay, but you may have seen Paul Romer (a Nobel level guy) completely savage the life’s work of Prescott and Lucas (Nobel winners both) this week. This type of critique is very common. An untenured AP, David Albouy, savaged the most famous paper of Acemoglu and his comment was published in the AER. There is massive disagreement about the importance of identification vs external validity, with active debate between big stars (Nobel winner Deaton, for instance, arguing against two future Nobel winners on the other side).
    4) “Political” work is frowned upon – I have no idea the political ideology of most of the well-known faculty in my area. This limits the pressure to p-hack to meet priors.

    • I don’t see how “political” work has ever been frowned upon in economics, particularly since “political” ideology is suffuse throughout economics (Mankiw/Krugman–both seem to come up with scholarly work that reflects their beliefs). I haven’t looked at the economics literature recently but Krugman makes a good point that the fresh-water stuff is simply wrong in the current situation. I’m most familiar with economists when they get policy positions in government and aside from creating the conditions which lead to the Great Recession (Summers/derivatives, Greenspan/regulation), they weren’t able to speak with anything close to a unified voice on how to solve the crisis (compare environmental scientists on climate change, for example). And some of us still remember “Time on the Cross” (pithy description, not only was slavery profitable but the slaves enjoyed it to–via the magic of revealed preference analysis–I’d though in “The Bell Curve” but you could plausibly argue that Murray is not an economist–though his social darwinism would fit in any economic department).

      • Studies like this and this suggest that replicability is a problem for economics, too.

        For a more truthful summary of The Time on the Cross, see here.

        The Bell Curve was written by a psychologist and a political scientist, so I’m not sure why you bring it up. It was harshly attacked by a number of economists, but Murray and Herrnstein are of course right in their main arguments.

    • Really? Economists are constantly inappropriately drawing causal, policy conclusions from correlational data (and yes, I’m familiar with the “causal inference” world). Data analyses are often readily overturned depending on the dependent variable selected (of very similar variables) or depending on what exactly one decided to “control for.” There are incentives to publish cute/clever/unexpected findings in economics. And political ideology does have a dramatic influence on some of the decisions therein. I don’t think economists should be throwing stones at psychologists.

      A psychologist who recognizes at least that her field has a problem.

    • Economics seems more masculine than other social sciences (the first female quasi-Nobelist in econ wasn’t until a few years ago) and it seems more like a contact sport than other social sciences.

      Psychology seems more feminine: Dr. Fiske’s essay could be summed up as, “Well, I never!”

  18. Other behavioral science fields that do randomized experiments (e.g., clinical medicine, development economics) have incentives to find interventions with large effects, and to know whether they are large relative to costs.

    On the other hand, for various reasons, social psychology has recently often tried to find effects of really minimal interventions that don’t need to be something realistically implementable etc. But more or less kept the same sample sizes that give you plausibly good power for much more substantial interventions.

  19. Unlike physical sciences psychology has no well established theoretical consensus against which nutso outcomes can be evaluated. Science is about coherence (a no on that as Alice’s Queen would say) consilience (baskets full of papers having nothing to do with each other) and consensus (everybunny agrees on climate change or at least 97%)

    • Scott:

      When I said that the crisis has engulfed Dweck, I wasn’t referring to any particular replication attempt. It may be that such attempts (successful or failed) of her work exist, but I have no idea. To me, the replication crisis is not just about replication, it’s also about methodological criticism. For example, when I read the paper by Kanazawa and saw that his sample size was way too small to estimate any realistic effect size, I consider that to be an example of the replication crisis, even though any replication was hypothetical.

      Similarly, when I wrote that Dweck’s estimated effect sizes were probably too high, given the bias arising from the statistical significance filter and researcher degrees of freedom, I consider those issues to be part of the replication crisis, in that I’m pretty sure that if those studies of Dweck’s were to be subject to preregistered replications, then the estimated effect sizes in the replications would probably be much smaller than the estimates reported in her papers, and also likely not statistically significant. Part of the replication crisis is that it calls into question certain published claims, even in advance of any actual replication. And I think this is fair in that, statistically, such claims were systematically overstated.

      To say that Dweck’s work has been engulfed by the replication crisis is not to imply that she has done anything unethical. Rather, she and her colleagues were using standard methods that had big big problems that most of us (Meehl and a few others excepted) did not fully recognize.

  20. I would suggest that a historical perspective could contribute to an understanding of how psychology got where it is.

    Especially in social psychology, early empirical work was a demonstration or application of strong and rich theories.
    Theory had the primacy and empirical evidence was not really conceived of as ‘evidence’ but as showcasing the theory.

    However, the showcasing attracted a lot of attention and soon replaced theory as the main scholary contribution. Research designs the worked well as toy examples in the showcasing paradigm were mistaken for evidence. Theory was degraded to funny anecdotes that served as the opener of paper.

  21. I’d add another reason for why psychology seems to stand out: although psychologists are in general offended at this notion, but even laypersons have some psychological knowledge or at least an opinion on psychological matters; what’s more, despite the sophisticated vocabulary, all studies can be summarized and made understandable to the reader, i.e. the reader can have an opinion (perhaps wrong, but I don’t see myself having an opinion on protein synthesis, but I will definitely have an opinion on whether learning *after* an exam will boost my grade [Deryl Bem] or whether it seems plausible for hurricane names to be linked with how devastating they are).
    So the bottom line is: it’s pretty easy to have an opinion about psychological topics (even if it’s wrong or based on an misunderstanding of the topic) and it’s easy for a “common sense skeptic” to notice studies, which seem plain wrong or uninformative.

    • The corollary of this is that there is a tremendous market in the popular media for gee-whiz studies, because everyone wants to learn about themselves (but especially, about their spouses and co-workers). There are plenty of people prepared to lap up anything that has the respectability of /a/ “science” and /b/ something as “authoritative” as the Huffington Post, Cosmopolitan, or the Daily Mail.

      My late parents-in-law subscribed to the view that pretty much anything that was printed in a newspaper or spoken on TV must be true, because they “knew” that publishing/going on TV was “hard” and “only available to Very Serious People”. They had this idea that people Wouldn’t Be Allowed to print and distribute something if it wasn’t True. There are still a great many people who subscribe to ideas not much more sophisticated than this, especially with added Scienciness.

      As “John Schmidt” (the new challenger to Dr. Primestein) put it in his piece on the Noise Miners:

      Sue told me that the noise was featured in Psychology Today and Buzzfeed. “We’re just happy we can be a small part of peoples’ lives, giving them the noise they need to start a conversation with a stranger, argue with their friends, or confirm their own pre-existing biases. It’s a small thing, but that’s what makes noise mining so special.”

  22. Another important reason, one of the most important in my view, is that psychology experiments are often technically very easy to replicate, and the advent of MTURK made it possible to replicate them with considerably more participants and lower cost than the original studies required. Thus is is easier to check whether psychology studies replicate, and therefore easier to find ones that don’t.

    • This was my take on why John Ioannidis had much more success identifying problems in genetic association studies than many others had in randomized clinical trails.

      With randomized clinical trails, if newer studies were underway, it was years before one would know how well they replicated whereas with genetic association studies often other labs had materials already in hand that they could see how well things replicated.

      A drastic change in cost and speed of assessing replication and also likely higher visibility of more seriously sciency subject of genetics.

  23. I’m very happy that Andrew made this post because it got me thinking a bit further about this. It is true that we in psychology once enjoyed a large lead in statistical sophistication. It’s one of the primary reasons we have always insisted on teaching our own stats courses (that and the recognition that examples matter a lot). However, that’s also made us a bit insular and I think that within the last few years the training of new students in other social science fields often surpasses us.

    Nevertheless, having the crisis center on my field probably means the necessary change will occur there soonest. Those changes most likely will have to do with incentives so I look very much forward to that.

    For those who think the primary reason the replication crisis is happening in psychology is because it’s a particularly weak science, I point you to retraction watch. We most definitely do not dominate the top spots on the leaderboard. And if you look through carefully you see representatives in biological sciences, social sciences, “hard” sciences, the works.

  24. I would guess that neuroscience will be the next discipline to fall in the replicability crisis. The enormous number of possible comparisons from large data sets, small number of subjects and low power, opportunities for post-hoc hypothesis adjustment, and seductive opportunities for splashy headline-grabbing research, make it the obvious pick. If I had to guess a particular subfield, it’d be sex differences.

      • I think microarrays are already on the outs, the biologists I know never really trusted them. The alternative technology that is fully replacing microarrays is what’s called Next Generation Sequencing, aka RNA-seq. This basically involves sequencing small fragments of RNA to see what fraction of the fragments are from each section of the genome, and thereby infer something about how much protein is being expressed from each gene. The thing Biologists like about it is that it gives counts, and this somehow (probably falsely) reassures them that it’s a reliable method. I think quantifying fluorescence which is the chip/array based technology is problematic because there can be lots of reasons (contamination etc) why things fluoresce, at least with RNA-seq you know that you’re counting RNA fragments and not quantifying how much of a fluorescent dye/protein/whatever you accidentally contaminated your sample with.

        That being said, many problems still remain with RNA-seq, and making conclusions based on RNA-seq data is very problematic. It’s much more of a discovery tool than something that validates a theory. For discovery of relevant pathways/genes/proteins it seems reasonable. Following up on that with genetic manipulations is the key to getting real discoveries, but doing that carefully costs years of time and many dollars. Those who publish flashy findings out of RNA-seq analyses are still rewarded with high publication counts… it’s definitely a problem

        • Another problem with chip methods is that you only look for/detect things that are included on the chip — so no possibility of discovering something new, as well as ascertainment bias.

        • Well usually there will be a follow up experiment using qPCR, which is relatively cheap. I think the field is well aware of this issue. Biological papers are often a mixture of different techniques, that measures different aspects of the same phenomena, I think the replication issue in cancer research is more to do with definition of “response to treatment ” rather than NGS analytical methods.or even pvalues. .. ;)

  25. “Psychology’s discourse on validity, reliability, and latent constructs is much more sophisticated than the usual treatment….”

    This does strike me as rather immodest hyperbole. Lots of fields have worked out sophisticated statistical analyses.

    I suspect that the problems have arisen most obviously in areas that use a lot of NHST. Psychology and clinical trials use them a lot, as do some sorts of lab sciences. Others use them very little. My own area, the stochastic analysis of single molecules is concerned largely with estimation. So far, at least, there hasn’t been serious disagreements about the data in that field. Searching through the pdfs of my publications back to the 1960s reveals that I have never once used the word ‘significant’ in its statistical sense.

    • David:

      I don’t really know what you mean by “immodest” in this context. But, just to clarify: I agree that many psychology researchers, including many prominent members of the species, suffer from serious misunderstandings about statistics. But within psychometrics I’d say the discussion of validity, reliability, and latent constructs is sophisticated, more so than in, say, statistics, economics, or public health. Also, remember that many ideas that are relatively new in statistics were previously invented in psychometrics many decades before. So, even though Bargh, Cuddy, etc., display naive attitudes on statistics, I think that the advanced thinking in psychometrics explains some of the openness in psychology to ideas of replication and criticism.

  26. Well perhaps we should recall that “advanced thinking about psychometrics” in the 1930s did huge social harm by claiming to be able to measure the worth of a child at the age of 11. They were vastly over-confident in their factor analyses.

    It’s great that psychologists are now taking reproducibility seriously, but perhaps not so good that it took so long.

    I regularly ask audiences what they think a P value means, and it’s still rare to get an accurate definition of a P value, and even rarer for people to understand their limitations. Despite the valiant efforts of people like you :-)

    • David:

      Lots of bad stuff in psychometrics (as in other fields), sure. In my above post I wasn’t trying to make a case for psychology being better or worse than other fields; I was just trying to get at some reasons why psychology has shown awareness and action on the replication crisis, while other fields seem to be moving more slowly. Part of the story is, as you say, that null hypothesis significance testing has been central to so much psychology research. I think another part of the story is psychologists’ proximity to sophisticated thinking regarding measurement and latent constructs.

      • >”Part of the story is, as you say, that null hypothesis significance testing has been central to so much psychology research. I think another part of the story is psychologists’ proximity to sophisticated thinking regarding measurement and latent constructs.”

        Don’t forget that after education research, psychology was the first field to widely adopt and institutionalize NHST. Others like medicine are lagging by a few decades. They haven’t fully experienced the negative effects yet. Remember that for the first 30-40 years the big names will still have been trained without the new “technique”. Those people are in a position to stem the flow of nonsense, or at least it’s impact.

        *If* psychology is really going to change its ways, it looks like it takes a lower bound of one generation trained and living out a career using it to generate enough BS that the next senses something seriously wrong and initiates the process of abandoning it.

  27. Thee is a touching faith in some of the comments above (and in Andrew Gelman) that psychometrics and the kind of individual differences psychology found within behavioural genetics is in some ways methodologically superior to other areas of psychological research.

    That faith is completely unfounded.

    1. The major problem facing all of psychology has nothing to do with the sophistication of its statistical methods or thinking. It’s all about so many psychologists failing to understand the constituent properties of measurement. I’ve summarised the main bodies of evidence in this regard, and its consequences in an open-access article … the references alone will be an eye-opener to many.
    Barrett, P.T. (2018). The EFPA test-review model: When good intentions meet a methodological thought disorder.. Behavioural Sciences (https://www.mdpi.com/2076-328X/8/1/5), 8,1, 5, 1-22.
    Psychometrics is, as Joel Michell has stated many times, a pathology of science. Those who maintain the pretence of course ignore all this work. But ignored facts have a habit of leaking unexpected adverse consequences, regardless of the efforts by many to maintain a studious ignorance.

    I would also highly recommend Trendler, G. (2018). Conjoint measurement undone. Theory and Psychology (http://journals.sagepub.com/doi/abs/10.1177/0959354318788729), In Press, , 1-29, who addresses Andrew’s focus on statistical issues while ignoring the more fundamental measurement problem.

    2. The Behavioural genetics GWAS work is hardly credible – as a very recent review article sets out:
    Feldman, M.W., & Ramachandran, S. (2018). Missing compared to what? Revisiting heritability, genes and culture. Philosophical Transactions of the Royal Society: Series B (http://dx.doi.org/10.1098/rstb.2017.0064), 373, 1-8.

    3. As to latent variables/models and all that statistical codswallop .. I suggest a close reading of Part 2 of Mike Maraun’s online book:
    The Myth of Latent Variables.
    Maraun, M.D., & Gabriel, S.M. (2013). Illegitimate concept equating in the partial fusion of construct validation theory and latent variable modeling. New Ideas in Psychology, 31, 1, 32-42.

    4. In terms of what we should be looking at/doing things differently, might I suggest two articles to act as intellectual ‘prods’, especially the latter:
    Ferguson, C.J. (2015). “Everybody knows psychology is not a real science”: Public perceptions of psychology and how we can improve our relationship with policymakers, the scientific community, and the general public. American Psychologist, 70, 6, 527-542
    Tryon, W.W. (2016). Underreliance on mechanistic models: Comment on Ferguson (2015). American Psychologist, 71, 6, 505-506.

    What’s missing in many psychologists is a fundamental honesty and scientific integrity – the kind spoken of by both the late David Freedman & Richard Berk, and Richard Feynman:
    Freedman, D.A., & Berk, R.A. (2003). Statistical assumptions as empirical commitments. In T.G. Blomberg & S.Cohen (Eds.). Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2nd ed. (pp. 235-254).

    and especially:
    Feynman, R.P. (1974). Cargo Cult Science: some remarks on science, pseudoscience, and learning how not to fool yourself. Engineering and Science, 37, 7, 10-13 (see p. 11)
    “In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas – he’s the controller -and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.

    Now it behooves me, of course, to tell you what they’re missing. . . . It’s a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty -a kind of leaning over backwards. For example, if you’re doing an experiment, you should report everything that you think might make it invalid -not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you’ve eliminated by some other experiment, and how they worked -to make sure the other fellow can tell they have been eliminated. . . . In summary, the idea is to try to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgment in one particular direction or another”.

    Until psychologists begin to think again, as honest investigative scientists rather than statisticians or publishing factories, then we are going to be stuck with their junk-science for years to come, regardless of any clerical attempts to stop them fudging data and results.

    • Paul:

      I have appreciated your insights about the limitations of psychometrics for quite some time (https://www.ncbi.nlm.nih.gov/pubmed/16171413) and appreciate the recent citations and links on the topic. I once held out hope for the possibility for modern psychometric methods to enforce a rethinking of the nature of the process itself, but over time have come to the same conclusion as you have. There are too many incentives to avoid acknowledging the limitations of psychological “measurement” to expect substantial changes anytime soon.

Leave a Reply

Your email address will not be published. Required fields are marked *