What can be learned from this study?

James Coyne writes:

A recent article co-authored by a leading mindfulness researcher claims to address the problems that plague meditation research, namely, underpowered studies; lack of or meaningful control groups; and an exclusive reliance on subjective self-report measures, rather than measures of the biological substrate that could establish possible mechanisms.

The article claims adequate sample size, includes two active meditation groups and three control groups and relies on seemingly sophisticated strategy for statistical analysis. What could possibly go wrong?

I think the study is underpowered to detect meaningful differences between active treatment and control groups. The authors haven’t thought out precisely how to use the presence of multiple control groups. They rely on statistical significance is the criterion for the value of the meditation groups. But when it comes to a reckoning, they avoid inevitably nonsignificant results that would occur comparisons of changes over time inactive versus control groups. Instead they substitute with group analyses and peer whether the results are significant for active treatments, but not control groups.

The article does not present power analyses but simply states that “a sample of 135 is considered to be a good sample size for growth curve modeling (Curran, Obeidat, & Losardo, 2010) or mediation analyses for medium-to-large effects (Fritz & MacKinnon, 2007).

There are five groups, representing two active treatments and three control groups. That means that all the relevant action depends on group by time interaction effects in pairs of active treatment and control groups, with 27 participants in each cell.

I have seen a lot of clinical trials in psychological interventions, but never one with two active treatments and three control groups. In the abstract it may seem interesting, but I have no idea what research questions would be answered by this constellation. I can’t even imagine planned comparisons that would follow up on overall treatment (5) by time interaction effect

The analytic strategy was to examine whether there is an overall group by time interaction effect and then at examine within-group, pre/post differences for particular variables. When these within group differences are statistically significant for an active treatment group, but not for the control groups, it is considered a confirmation hypothesis that meditation is effective with respect to certain variables.

When there are within-differences for both psychological and biological variables, it is inferred that the evidence is consistent with the biological statement he psychological changes.

There are then mediational analysis that follow a standard procedure: construction of zero order correlation matrix; calculation of residual change scores for each individual with creation of dummy variables for four of the groups contrasted against the mutual control group. Simple mediation effects were then calculated for each psychological self-report variable with group assignment as the predictor variable and physiological variable as the moderator.

I think these mediational analyses are a wasted effort because of the small number of subjects exposed to each intervention.

At this point I would usually read the article, perhaps make some calculations, read some related things, figure out my general conclusions, and then write everything up.

This time I decided to do something different and respond in real time.

So I’ll give my response, labeling each step.

1. First impressions

The article in question is Soothing Your Heart and Feeling Connected: A New Experimental Paradigm to Study the Benefits of Self-Compassion, by Hans Kirschner, Willem Kuyken, Kim Wright, Henrietta Roberts, Claire Brejcha, and Anke Karl, and it begins:

Self-compassion and its cultivation in psychological interventions are associated with improved mental health and well- being. However, the underlying processes for this are not well understood. We randomly assigned 135 participants to study the effect of two short-term self-compassion exercises on self-reported-state mood and psychophysiological responses compared to three control conditions of negative (rumination), neutral, and positive (excitement) valence. Increased self-reported-state self-compassion, affiliative affect, and decreased self-criticism were found after both self-compassion exercises and the positive-excitement condition. However, a psychophysiological response pattern of reduced arousal (reduced heart rate and skin conductance) and increased parasympathetic activation (increased heart rate variability) were unique to the self-compassion conditions. This pattern is associated with effective emotion regulation in times of adversity. As predicted, rumination triggered the opposite pattern across self-report and physiological responses. Furthermore, we found partial evidence that physiological arousal reduction and parasympathetic activation precede the experience of feeling safe and connected.

My correspondent’s concern was that the sample size was too small . . . let’s look at that part of the paper:

We recruited a total of 135 university students in the United Kingdom (27 per experimental condition . . .)

OK, so yes I’m concerned. 27 seems small, especially for a between-person design.

But is N really too small? It depends on effect size and variation.

Let’s look at the data.

Here are the basic data summaries:

I think these are averages: each dot is the average of 27 people.

The top four graphs are hard to interpret: I see there’s more variation after than before, but beyond that I’m not clear what to make of this.

So I’ll focus on the bottom three graphs, which have more data. The patterns seem pretty clear, and I expect there is a high correlation across time. I’d like to see the separate lines for each person. That last graph, of skin conductance level, is particularly striking in that the lines go up and then down in synchrony.

What’s the story here? Skin conductance seems like a clear enough outcome, even if not of direct interest it’s something that can be measured. The treatments, recall, were “two short-term self-compassion exercises” and “three control conditions of negative (rumination), neutral, and positive (excitement) valence.” I’m surprised to see such clear patterns from these treatments. I say this from a position of ignorance; just based on general impressions I would not have known to expect such consistency.

2. Data analysis

OK, now we seem to be going beyond first impressions . . .

So what data would I like to see to understand these results better? I like the graphs above, and now I want something more that focuses on treatment effects and differences between groups.

To start with, how about we summarize each person’s outcome by a single number. I’ll focus on the last three outcomes (e, f, g) shown above. Looking at the graphs, maybe we could summarize each by the average measurement during times 6 through 11. So, for each outcome, I want a scatterplot. Let y_i be person i’s average outcome during times 6 through 11, and x_i is the outcome at baseline. For each outcome, let’s plot y_i vs x_i. That’s a graph with 135 dots, you could use 5 colors, one for each treatment. Or maybe 5 different graphs, I’m not sure. There are three outcomes, so that’s 3 graphs or a 3 x 5 grid.

I’d also suggest averaging the three outcomes for each person so now there’s one total score. Standardize each score and reverse-code as appropriate (I guess that in this case we’d flip the sign of outcome f when adding up these three). This would be the clear summary we’d need.

I have the luxury of not needing to make a summary judgment on the conclusions, so I’ll just say that I’d like to see some scatterplots before going forward.

3. Other impressions

The paper gives a lot of numerical summaries of this sort:

The Group × Time ANOVA revealed no significant main effect of group, F(4,130) = 1.03, p > .05, ηp2 = .03. However, the Time × Group interaction yielded significance, F(4, 130) = 24.46, p < .001, ηp2 = .43. Post hoc analyses revealed that there was a significant pre-to-post increase in positive affiliative affect in the CBS condition, F(1, 26) = 10.53, p = .003, ηp2 = .28, 95% CI = [2.00, 8.93], the LKM-S condition, F(1, 26) = 26.79, p < .001, ηp2 = .51, 95% CI = [5.43, 12.59] and, albeit smaller, for the positive condition, F(1, 26) = 6.12, p = .020, ηp2 = .19, 95% CI = [0.69, 7.46]. In the rumination condition there was a significant decrease in positive affiliative affect after the manipulation, F(1, 26) = 38.90, p < .001, ηp2 = .60, 95% CI = [–18.79, –9.48], whereas no pre-to-post manipulation difference emerged for the control condition, F(1, 26) = .49, p = 486, ηp2 = .01, 95% CI = [–4.77, 2.33]. Interestingly, an ANCOVA (see Supplemental Material) revealed that after induction, only individuals in the LKM-S condition reported significantly higher positive affiliative affect than those in the neutral condition, and individuals in the rumination condition reported significantly lower positive affiliative affect.

This looks like word salad—or, should I say, number salad—and full of forking paths. Just a mess, as it’s some subset of all the many comparisons that could be performed. I know this sort of thing is standard data-analytic practice in many fields of research, so it’s not like this paper stands out in a bad way; still, I don’t find these summaries to be at all helpful. I’d rather do a multilevel model.

And then there’s this:

No way. I’m not even gonna bother with this.

The paper concludes with some speculations:

Short-term self-compassion exercises may exert their beneficial effect by temporarily activating a low-arousal parasympathetic positive affective system that has been associated with stress reduction, social affiliation, and effective emotion regulation

Short-term self-compassion exercises may exert their beneficial effect by temporarily increasing positive self and reducing negative self-bias, thus potentially addressing cognitive vulnerabilities for mental disorders

I appreciate that the authors clearly labeled these as speculations, possibilities, etc., and the paper’s final sentences were also tentative:

We conclude that self-compassion reduces negative self-bias and activates a content and calm state of mind with a disposition for kindness, care, social connectedness, and the ability to self-soothe when stressed. Our paradigm might serve as a basis for future research in analogue and patient studies addressing several important outstanding questions.

4. Was the sample size too small?

The authors write:

Although the sample size in this study was based on a priori power calculation for medium effect sizes in mixed measures ANOVAs and the recruitment target was met, a larger sample size may have been desirable. Overall, a sample of 135 is considered to be a good sample size for growth curve modeling (Curran, Obeidat, & Losardo, 2010) or mediation analyses for medium-to-large effects (Fritz & MacKinnon, 2007). However, some of the effects were small-to-medium rather than medium and failed to reach significance, and thus a replication in a larger sample is warranted to check the robustness of our effects.

This raises some red flags to me, as it’s been my impression that real-life effects in psychology experiments are typically much smaller than what are called “medium effect sizes” in the literature. Also I think the above paragraph reveals some misunderstanding about effect sizes in that the authors are essentially doing post-hoc power analysis, not recognizing the high variability in effect size estimates; for more background on this point, see here and here.

The other point I want to return to is the between-person design. Without any understanding of this particular subfield, I’d recommend a within-person study in the future, where you try multiple treatments on each person. If you’re worried about poisoning the well, you could do different treatments on different days.

Speaking more generally, I’d like to move the question away from sample size and toward questions of measurement. Beyond the suggestion to perform multiple treatments on each person, I’ll return to my correspondent’s questions at the top of this post, which I can’t really evaluate myself, not knowing enough about this area.

8 thoughts on “What can be learned from this study?

  1. “I have seen a lot of clinical trials in psychological interventions, but never one with two active treatments and three control groups. In the abstract it may seem interesting, but I have no idea what research questions would be answered by this constellation. I can’t even imagine planned comparisons that would follow up on overall treatment (5) by time interaction effect”

    ???

    This kind of thing makes me sad. Statistics isn’t about calculating the average differences between two groups, statistics is, or at least should be, about *inference about unknowns*.

  2. The article claims adequate sample size, includes two active meditation groups and three control groups and relies on seemingly sophisticated strategy for statistical analysis. What could possibly go wrong?

    It is funny because I read this and think it is designed to check whether differences exist between groups which is a total waste of time. So funny to see “what could go wrong” when the entire purpose is wrong.

    I think these are averages: each dot is the average of 27 people.

    The curves looked interesting (it would seem fruitful to find some theoretical explanations that could reproduce the shape) but we need to see the individual ones. Average curves can introduce all sorts of misleading artifacts, this was discussed a few years ago on this blog: https://statmodeling.stat.columbia.edu/2015/10/05/cognitive-skills-rising-and-falling/

  3. A non-scientific comment on skin conductance:

    The Pacific Science Center, many years ago, had a little exhibit which consisted of a skin conductance sensor and a toy windmill. If you could change (I don’t recall anymore if it was raise or lower) your skin conductance, you could make the windmill spin. My spouse and I both figured out how to do this in a few seconds. The person behind us in line was utterly baffled: he not only couldn’t do it but had no idea how to even try, nor could we explain.

    In retrospect I believe we were using very basic meditative techniques: slow and deepen your breathing, relax your abdominal muscles. I’m not surprised you’d get something on this variable in a meditation study.

    • That’s interesting, and it leads to another interesting point that Andrew’s correspondent (Coyne) often raises — which is the difficulty of designing an adequate control group for a “meditation” study. (Actually researchers don’t even try.) You and your spouse weren’t “meditating” per se but you created one of the effects. It’s equally possible that some or all of the benefits of a “meditation class” can be gained just by going to a class with a soothing teacher and some other friendly people once or twice a week, without actually “meditating.” This is never studied because that’s not what the people in the control arms do. It’s doubtful that even a massive study with a control condition as weak as they usually are would really teach us anything.

      Coyne likes to point to another study the results of which suggested that tango lessons are more effective in lessening depression than meditation is, which at least was an interesting result and shouldn’t surprise us.

  4. Thanks, Andrew and commentators for a fascinating discussion. I was validated in my own thinking and learned a lot from Andrew’s outsider perspective. My approach to a study like this is to keep in mind that it is a RCT and so we have to be less interested in results obtained in the intervention group than in whether there is any difference over time between the intervention and control group. Absent a time x treatment ibteraction, we can’t be sure that what was happening in the intervention group was due to uncontrolled nonspecific factors contaminating the intervention group. In this case, the authors simply ignored differences between intervention and control groups over time, bringing the comparison up wonly when it supported their bias in looking at a particular variable and differences between selected intervention and a particular control group. Andrew nails the results section with his ‘word salad’ dismissal.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *