Skip to content

What’s the evidence on the effectiveness of psychotherapy?

Kyle Dirck points us to this article by John Sakaluk, Robyn Kilshaw, Alexander Williams, and Kathleen Rhyner in the Journal of Abnormal Psychology, which begins:

Empirically supported treatments (or therapies; ESTs) are the gold standard in therapeutic interventions for psychopathology. Based on a set of methodological and statistical criteria, the APA [American Psychological Association] has assigned particular treatment-diagnosis combinations EST status and has further rated their empirical support as Strong, Modest, and/or Controversial. Emerging concerns about the replicability of research findings in clinical psychology highlight the need to critically examine the evidential value of EST research. We therefore conducted a meta-scientific review of the EST literature.

And here’s what they found:

This review suggests that although the underlying evidence for a small number of empirically supported therapies is consistently strong across a range of metrics, the evidence is mixed or consistently weak for many, including some classified by Division 12 of the APA as “Strong.”

It was hard for me to follow exactly which are the therapies that clearly work and which are the ones where the evidence is so clear. This seems like an important detail, no? Or maybe I’m missing the point. The difference between significant and not significant is not statistically significant, right?

They also write:

Finally, though the trend towards increased statistical power in EST research is a positive development, there must be greater continued effort to increase the evidential value—broadly construed—of the EST literature . . . EST research may need to eschew the model of small trials. A combined workflow of larger multi-lab registered reports (Chambers, 2013; Uhlmann et al., 2018) coupled with thorough analytic review (Sakaluk, Williams, & Biernat, 2014) would yield the highest degree of confirmatory, accurate evidence for the efficacy of ESTs.

This makes sense. But, speaking generally, I think it’s important when talking about improved data collection to not just talk about increasing your sample size. Don’t forget measurement. I don’t know enough about psychotherapy to say anything specific, but there should be ways of getting repeated measurements on people, intermediate outcomes, etc., going beyond up-or-down summaries to learn more from each person in these studies.


So, there’s lots going on here, statistically speaking, regarding the very important topic of the effectiveness of psychotherapy.

First, I’d like to ask, Which treatments work and which don’t? But we can’t possibly answer that question. The right thing to do is to look at the evidence on different treatments and summarize as well as we can, without trying to make a sharp dividing line between treatments that work and treatments that don’t, or are unproven.

Second, different treatments work for different people, and in different situations. That’s the real target: trying to figure out what to do when. And, for reasons we’ve discussed, there’s no way we can expect to approach anything like certainty when addressing such questions.

Third, when gathering data and assessing evidence, we have to move beyond procedural ideas such as preregistration and the simple statistical idea of increasing N, and think about design and data quality linked to theoretical understanding and real-world goals.

I’ve put that last paragraph in bold, as perhaps it will be the most relevance to many of you who don’t study psychotherapy but are interested in experimental science.

Tomorrow’s post: What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?


  1. jim says:

    “Third, when gathering data and assessing evidence, we have to move beyond procedural ideas such as preregistration and the simple statistical idea of increasing N, and think about design and data quality linked to theoretical understanding and real-world goals.”


    Well said, Andrew!

    • Shravan Vasishth says:

      Preregistration and sample size, vs. thinking about design and data quality: these don’t seem like either-or propositions. As an experimentalist, I care about all of the above. There are many situations where pre-registration is not relevant. But that doesn’t mean we should move beyond it; when it makes sense to pre-register, it makes sense.

      • Martha (Smith) says:

        Reading Shravan’s comment and then looking back at Jim’s:

        I think there may be too different interpretations of “moving beyond” going on:

        I read Jim’s quote from Andrew ( “we have to move beyond procedural ideas such as preregistration and the simple statistical idea of increasing N, and think about design and data quality linked to theoretical understanding and real-world goals”) as saying that procedural ideas such as preregistration and increasing sample size are not enough — that we *also* need to “think about design and data quality linked to theoretical understanding and real-world goals”.

        But Shravan seems to have interpreted “moving beyond” as leaving ” procedural ideas such as preregistration and the simple statistical idea of increasing N” behind.

        • Shravan says:

          Andrew‘s comment here is being used on twitter as support for this paper:

          Preregistration is definitely not redundant in some of my work. The fact that preregistration is not a magic bullet doesn‘t mean all the other things mentioned in the paper are not important.

          I didn‘t follow all the comments on twitter about the preregistration paper linked above (i was put off by the clickbait tile), but my impression (someone will correct me if i am wrong) was that there is an actively anti-prereg crowd. Andrew could write his comments in more moderate and clear language, as the ambiguity Martha points to above can be used to find support for the extremist view implied by the title of the above paper.

          This kind of ambiguity in Andrew’s writing has arisen in the past, in connection with log transformation. It caused at least me a lot of grief because people tried to force me to analyze reading time data on the raw scale because Andrew said so. Now ppl will say Andrew is anti pre-reg (which he is not).

          • Shravan Vasishth says:

            Some examples where pre-registration was very useful for me:

            1. Here is an example of a paper (in which Andrew was also co-author), where we ran six experiments un-preregistered, and then the decisive seventh experiment was pre-registered. The theory behind the studies is very solid, it’s just that the data are noisy (the problem that Andrew is stressing in his blog). The sample sizes are absurdly small, but, echoing Andrew’s point above, the design was terrible (even Germans can’t understand those sentences), and the effect is likely to be very variable from study to study anyway.


            In the pre-registered seventh experiment, the claimed effect was not reproduced.

            2. Here is another paper (also on predictability in reading) where the claim was theoretically well-grounded, but the outcomes from previous studies are highly variable and controversial. The lead author pre-registered the confirmatory analyses, none of them showed anything conclusive. This does not prevent us from exploring the data in a second step, and that is what we did:


            3. Here is a paper where we didn’t pre-register anything, but rather did an exploratory analysis with n subjects, and then froze the analysis plan and collected another n subjects’ data, and then did no more exploration.


            4. Finally, here is a paper in which we did not pre-register anything, we just re-did a published study with a larger sample size, and stuck as far as possible to the original procedure reported in the original paper. We failed to find patterns in the dependent measure that originally claimed to find something meaningful. Here, too, the theory is solid.


            So, I am a fan of pre-registration but whether it’s the right instrument to use or not is a function of the type of problem one is studying and the specifics of the particular situation one finds oneself in. E.g., in example 4 above, the experiment was run in a different lab than our own, it would be hard to control every detail of the implementation of the experiment.

            My main job is computational modeling, we just do experiments to understand whether model predictions turn out right or not. I find exploratory analyses very useful and something to be encouraged, in modeling but also in data analysis.

            But I am getting annoyed with extremist positions that are at least implied if not logically entailed by statements like “pre-registration is redundant, at best” (the title of the paper I linked to above). It would be great if Andrew didn’t (accidentally) feed into that extremist view. Everything has its place; pre-registration is an amazingly important tool, but you understand what it is for and what it does not and cannot achieve.

          • Andrew says:


            I’m not anti preregistration. I think preregistration is great. I’ve even done it twice in my own work!

            But I don’t see the problem with the linked paper, except for its dramatic title (“Preregistration is redundant, at best”). I don’t think preregistration is redundant. But if you read the article itself, not just the title, they say reasonable things:

            There is . . . little reason to expect that preregistration will spontaneously help researchers to develop better theories (and, hence, better methods and analyses).

            Solving statistical problems with preregistration does not compensate for weak theory.

            There is nothing inherently problematic about post-hoc scientific inference when theories are strong.

            Because statistical inference is so often used to inform scientific inference, it is easy to conflate the two. However, they have fundamentally different aims.

            I pretty much agree with everything in that paper, except the title claim that preregistration is redundant. The funny thing is, I don’t think the paper ever gives a good justification for its title. Here’s the entire section where this redundancy thing comes up:

            Preregistration is redundant, and potentially harmful

            Preregistration can be hard, but it can also be easy. The hard work associated with good theorizing is independent of the act of preregistration. What matters is the strength of the theory upon which scientific inference rests. Preregistration does not require that the underlying theory be strong, nor does it discriminate between experiments based on strong or weak theory.

            Taking preregistration as a measure of scientific excellence can be harmful, because bad theories, methods, and analyses can also be preregistered. Requiring or rewarding the act of preregistration is not worthwhile when its presumed benefits can be achieved without it just as well. As a field, we have to grapple with the difficult challenge of improving our ability to test theories, and so should be wary of any ‘catch-all’ solutions like preregistration.

            There are two points here:

            1. “Preregistration is redundant.” They say this, but I don’t see their justification for this claim. I agree that “Preregistration does not require that the underlying theory be strong, nor does it discriminate between experiments based on strong or weak theory,” but I don’t see how this makes preregistration “redundant.”

            2. Preregistration is “potentially harmful.” I guess so, in the sense that any proposed innovation is potentially harmful if it leads to complacency. Seat belts are potentially harmful if they lead you to drive much more dangerously, a warm coat potentially harmful if it leads you to spend too much time outside and then you get frostbite, etc. I agree that preregistration, if oversold, can be harmful in the same way that other research methods, if oversold, can be harmful.

            To put it another way, you could swap out the word “preregistration” everywhere in that article and replace it with “random sampling” or “random assignment of treatments,” and all the claims would still hold. Random sampling and random assignment of treatments can give us bulletproof inferences in some cases, and can motivate more careful data collection even when they are only approximations, but they’re no substitute for good theory and good data.

            • Shravan says:

              Yes, i agree with everything you say here. Thanks for spelling this out. My problem with that paper is just the clickbait title; in my experience the title is all one internalizes. Their discussion in the paper is all reasonable. The title, not so much.

  2. Terry says:

    This topic is very interesting, very important, and the authors seem to have done a lot of valuable work.

    But I don’t understand the results. Granted, this isn’t my area of expertise, but I’m capable of understanding this stuff if I try.

    Perhaps the study is just so big and covers so much that it can’t be made readable to someone like me. Or perhaps there is a way to make Table 2 more intuitive. The wide ranges made it particularly hard to get a grip on the results. Maybe too many results are clumped together in Table 2. Perhaps each entry could be expanded and made more intuitive. But then it wouldn’t be a single table, but more of a voluminous appendix or an encyclopedia.

  3. Erling says:

    I think calling for better measurement is very spot on for the psychotherapy research field. Speaking from experience, increasing N is very hard. Patients with the right disorder, consenting to participate, these are necessarily not available in great numbers in a restricted sampling time (even when grants and approvals last several years) and a catchment area with reasonable distances to the clinic. Multisite trials are obviously a good idea, but come with their own challenges.

    Given the state of measurement and analysis in this field, brute-forcing it with higher N is not the most viable path to better evidence, I believe. Measurement is plagued by strong pressure to use the same instruments as previous research, even though psychometric problems such as low reliability for measurement of change is well known for many of these.

    And there has long been called for studying what works for whom, but average effects still rule in what is reported, although heterogeneity is rampant, and clinical experience suggests this is a major issue.

    (As a clinical psychologist doing psychotherapy research using Stan, I really looked forward to this post, by the way).

  4. Martha (Smith) says:

    A quote from the “conclusion” section of the paper:

    “When strong evidential support is lacking for all ESTs of a given diagnosis (e.g., Anorexia Nervosa), psychologists working with such patients should consider increasing their use of research-based assessments to track therapeutic benefit or the lack thereof.”

    Duh. Isn’t this common sense and basic ethical practice? — “First do no harm?”

  5. Wonks Anonymous says:

    Robyn Dawes wrote in “House of Cards” that the effectiveness of therapy had no apparent connection to the type or length of the therapy, or of the credentials of the therapist:

  6. Geoff says:

    Sakaluk is doing some good work (from Sakaluk’s webpage):
    “…we are studying what relational factors enable couples to work together more effectively, and with less stress, as they try to solve a complicated relational challenge: IKEA furniture assembly”

  7. Erikson says:

    The idea of “What works for whom” is widely debated in empirical research in psychotherapy. Many researches (lika APA’s ESTs) argue in favor of “specific effects” that should hold between specific treatments and psychopathologies. Others (like Wampold) argue in favor of the “Dodo bird verdict”: that effects are reasonably homogeneous between treatment and conditions and that there is no systematic differences between specific psychotherapies. Regardless if you believe in the Dodo bird verdict as absolute, it is quite baffling how hard it is to identify systematic (and replication-proof) differences between psychotherapies. Maybe we shouldn’t think and evaluate psychotherapy as we do with medical treatments (specific agents for specific pathologies).

    In part, though, this complicated states of affairs is due to the issues Andrew points in his post. First, design: what is unquestionable placebo control group to compare the effects of psychotherapies? How to deal with the fact that, in practice, most psychotherapies are tailored to the patient and are not easily manualized (nor the results of manualized treatment is easily translated to real world practices)? In many studies, data are collected at many points (pre-treatent, mid-treatment, post-treatment, plethora of follow-up windows) but the analysis is usually summarized on those terrible tabular asterisks or filtered by some sort of significance criterion, as the paper already indicates. Hardly ever we see some theoretical motivated model, just your run-of-the-mill linear model, sometimes applied individually for each comparison (t-tests galore!). The problem is due, in part, from following the medical RCT guidelines without proper adaptation to the reality of psychotherapy.

    But the issue of measurement is particularly problematic. Most ‘measurement’ in those trials (as in most psychology) is the sum-score of some self reported or interview questionnaires. Although there is a whole host of literature supporting those models of measurement (mostly based on the Classic Test Theory or Item Response Theory), they are based on the assumption of an unobservable latent variable which reflects its effects on (locally independent) statements about cognition, affect and relations. The very idea of proper existence of those latent variables is not properly tested, in most cases, and we just assume that the number obtained from the scale or questionnaire is an useful summary proxy of what we would like to measure. But it makes hard to think about clinical judgement: what does it really mean to go down a few points in a depression scale? Or can we really treat ‘depression’ as a single, continuous latent variable (or as a bunch of different continuous latent variables, as some research ‘discovers’ with the mindless application of factor analysis)?

    Indeed, the question remains: how can we better design trials suited to the reality of psychotherapy, insted of mimicing medical RCTs? And how can we develop better measurements in a context where (usually empirically unsupported) latent variable models abound? Increasing N in this situation will only gives us the illusion of certainty about biased results.

    • Kyle C says:

      Great summary. As a former patient and not an expert, I would only add that, to some non-trivial extent, cognitive behavioral therapy teaches people how to answer a depression questionnaire. Certain responses are “wrong.” It has always seemed to me especially problematic to “assume that the number obtained from the scale or questionnaire is an useful summary proxy of what we would like to measure” for CBT.

      • Erling says:

        I’d say that CBT that teaches patients that particular thoughts are “wrong” is badly conducted CBT – and while that certainly happens, it’s probably less representative of the CBT conducted in trials than that conducted in regular practice. Context of treatment, level of training and supervision, these things can differ greatly between research and general clinics, and of course compromises generalisability of trial findings.

        Your point still stands, though, reductions in simple sumscores are inadequate outcome measures. Remission and response proportions may be better, as well as properly standardised and calibrated measures (of which there are few).

        • Erikson says:

          Erling, you mentioned you are a researcher using Stan in this context! Would you mind sharing a little bit about your work?

          • Erling says:

            Oh, thanks for asking! It’s neither very much or very impressive, I’m afraid, but I think it has been considerably improved over the usual practice in the field by using Stan and PSIS-LOO.

            I’m currently finishing a paper reporting on a planned moderator analysis from a small trial comparing a family-based treatment for adolescent depression to treatment as usual. In this case, the patients were recruited among regular referrals to a child and adolescent mental health service, so they were definitely quite ill. The main hypothesis of the trial was that the family-based treatment would be superior to treatment as usual (this was not supported). The moderator hypothesis was that the difference in outcome between treatment as usual and the family-based treatment would vary by level of parent-adolescent conflict (a negative predictor of outcome in multiple other trials), with a larger difference in favour of the family-based treatment at high levels of parent-adolescent conflict.

            We compare a set of nested multilevel models of outcome fitted with Stan, using PSIS-LOO, with moderator models compared to simpler models of effects of time and treatment allocation only. We also embedded an IRT model of the measure used for assessing parent-adolescent conflict, using the latent variables in the moderator models rather than a sumscore, which is something we’d never be able to do without Stan. Our results give some support to moderation by mother-adolescent conflict, but with a lot of uncertainty. It’s not something we recommend changing clinical practice over, but rather supporting further investigation of parent-adolescent conflict as a moderator of treatment outcomes in adolescent depression.

            There’s a lot to be said for the quality of the trial (I was not the PI, but know underfinanced trials are hard to do well) but I find analysing the results using Stan and in a Bayesian framework helps to extract what little has been learned, and yet keep a level head about the uncertainty.

  8. Martha (Smith) says:

    I asked someone who had some graduate training in psychotherapy what she thought of this post. She pointed out a couple of other problems with psychotherapy trials that I found pretty shocking:

    Research subjects are often university undergraduates — very far of from a random sample of the population to which the results would be applied.

    In comparative trials, protocols for conducting therapy are often designed to give an advantage to a “favored” method (i.e., therapists in the less favored method might be forbidden from doing something that would be considered standard good practice.)

  9. John Sakaluk says:

    Appreciate you discussing the article here, Andrew! 

    Fitting everything we did into the constraints of a short report was a major challenge, and I think the communication of what it all means has suffered, to a degree, as a result. In essence, we reanalyzed tests that Div. 12 cited in support of psychotherapies it deemed “empirically supported” to some degree or another. We were curious, “What does ’empirically supported’ actually mean?” in terms of some of the metrics folks are looking at when appraising the quality of a study. Are the results accurately reported?; are they informed by an appreciable amount of data?; are the results plausible?; etc. Underlying the whole investigation is our perspective that when one hears “The efficacy of Therapy ‘X’ has ***Strong Empirical Supported**”, intuitively, that person probably conjures to mind a group of studies that yielded a lot of data, and which have been rigorously scrutinized for accuracy and empirical consistency with signs of efficacy.  Maybe different folks weight the importance of some of these elements higher/lower than others, but that it is a face-valid selection of the breadth of features that contribute to an intuitive sense of “strong evidence”. 

    Our intention here, I think, is not to presume to adjudicate which therapies work, and which don’t. But we do think it is interesting/important that our review finds therapies with no misreported tests, supported by studies containing massive amounts of data, plausible rates of statistically significant effects, and strong evidence of efficacy (at least according to Bayes Factors) housed under the same rhetorical umbrella of “Strong Empirical Support” as therapies with  frequent and substantial misreported tests, small amounts of data, implausibly frequent statistically significant effects, and weak (or null-supporting) Bayes factors. 

    Your other points are well-taken though. There’s a lot of provocative measurement work being done in the realm of clinical psych (Eiko Fried’s work comes to mind), that I think calls into question a lot of standard practices of conceptualizing and assessing psychopathology. Not to dismiss the importance of basic measurement work, but I think our perspective would be that whatever the conceptualization/measurement of mental health might be, we position ourselves to learn more when results are reportedly accurately, and informed by more data, and that bare-minimum features like that should be scrutinized before attaching rhetorical/professional labels to increase researcher/clinician/consumer confidence. 

Leave a Reply