Skip to content

You should (usually) log transform your positive data

The reason for log transforming your data is not to deal with skewness or to get closer to a normal distribution; that’s rarely what we care about. Validity, additivity, and linearity are typically much more important.

The reason for log transformation is in many settings it should make additive and linear models make more sense. A multiplicative model on the original scale corresponds to an additive model on the log scale. For example, a treatment that increases prices by 2%, rather than a treatment that increases prices by $20. The log transformation is particularly relevant when the data vary a lot on the relative scale. Increasing prices by 2% has a much different dollar effect for a $10 item than a $1000 item. This example also gives some sense of why a log transformation won’t be perfect either, and ultimately you can fit whatever sort of model you want—but, as I said, in most cases I’ve of positive data, the log transformation is a natural starting point.

The above is all background; it’s stuff that we’ve all said many times before.

What’s new to me is this story from Shravan Vasishth:

You’re routinely being cited as endorsing the idea that model assumptions like normality are the least important of all in a linear model:

This statement of yours is not meant to be a recommendation to NHST users. But it is being misused by psychologists and psycholinguists in the NHST context to justify analyzing untransformed all-positive dependent variables and then making binary decisions based on p-values. Could you clarify your point in the next edition of your book?

I just reviewed a paper in JML (where we published our statistical significance filter paper) by some psychologists that insist that all data be analyzed using untransformed reaction/reading times. They don’t cite you there, but threads like the one above do keep citing you in the NHST context. I know that on p 15 of Gelman and Hill you say that it is often helpful to log transform all-positive data, but people selectively cite this other comment in your book to justify not transforming.

There are data-sets where 3 out of 547 data points drive the entire p<0.05 effect. With a log transform there would be nothing to claim and indeed that claim is not replicable. I discuss that particular example here.

I responded that (a) I hate twitter, and (b) In the book we discuss the importance of transformations in bringing the data closer to a linear and additive model.

Shravan threw it back at me:

The problem in this case is not really twitter, in my opinion, but the fact that people . . . read more into your comments than you intended, I suspect. What bothers me is that they cite Gelman as endorsing not ever log-transforming all-positive data, citing that one comment in the book out of context. This is not the first time I saw the Gelman and Hill quote being used. I have seen it in journal reviews in which reviewers insisted I analyze data on the untransformed values.

I replied that is really strange given that in the book we explicitly discuss log transformation.

From page 59:

It commonly makes sense to take the logarithm of outcomes that are all-positive.

From page 65:

If a variable has a narrow dynamic range (that is, if the ratio between the high and low values is close to 1), then it will not make much of a difference in fit if the regression is on the logarithmic or the original scale. . . . In such a situation, it might seem to make sense to stay on the original scale for reasons of simplicity. However, the logarithmic transformation can make sense even here, because coefficients are often more easily understood on the log scale. . . . For an input with a larger amount of relative variation (for example, heights of children, or weights of animals), it would make sense to work with its logarithm immediately, both as an aid in interpretation and likely an improvement in fit too.

Are there really people going around saying that we endorse not ever log-transforming all-positive data? That’s really weird.

Apparently, the answer is yes. According to Shravan, people are aggressively arguing for not log-transforming.

That’s just wack.

Log transform, kids. And don’t listen to people who tell you otherwise.

Did that “bottomless soup bowl” experiment ever happen?

I’m trying to figure out if Brian “Pizzagate” Wansink’s famous “bottomless soup bowl” experiment really happened.

Way back when, everybody thought the experiment was real. After all, it was described in a peer-reviewed journal article.

Here’s my friend Seth Roberts in 2006:

An experiment in which people eat soup from a bottomless bowl? Classic! Or mythological: American Sisyphus. It really happened.

And here’s econ professor Richard Thaler and law professor Cass Sunstein in 2008:

Given that they described this experiment as a “masterpiece,” I assume they thought it was real.

Evidence that the experiment never happened

We’ve known for awhile that some of the numbers in the Wansink et al. “Bottomless bowls” article were fabricated, or altered, or mis-typed, or mis-described, or something. Here’s James Heathers with lots of details.

But I’d just assumed there really had been such an experiment . . . until I encountered two recent blog comments by Jim and Mary expressing skepticism:

For me, for sure, if I got a 6oz soup bowl that refilled itself without me knowing I’d just go right on eating gallon after gallon of soup, never noticing. . . . There’s no way he even did that! That has to be a complete fabrication.

If you try to imagine designing the refilling soup bowl, it gets harder and harder the more you think about it. The soup has to be entering the bowl at exactly the right rate. . . . I don’t think they really did this experiment. They got as far as making the bowls and stuff, but then it was too hard to get it to work, and they gave up. This would explain why an experimental design with 2 bottomless and 2 non-bottomless subjects per table ended up with 23 controls and 31 manipulations . . .

I searched the internet and found a photo of the refilling soup bowl. Go to 2:36 at this video.

See also this video with actors (Cornell students, perhaps?) which purports to demonstrate how the bowl could be set up in a restaurant. The video is obviously fake so it doesn’t give me any sense of how they could’ve done it in real life.

I also found this video where Wansink demonstrates the refilling bowl. But this bowl, unlike the one in the previous demonstration, is attached to the table so I don’t see how it could ever be delivered to someone sitting at a restaurant.

So when you look at it that way: an absurdly complicated apparatus, videos that purport to be reconstructions but which lack plausibility, and no evidence of any real data . . . If seems that the whole thing could be a fake, that there was no experiment after all. Maybe they built the damn thing, tried it out on some real students, it didn’t work, and then they made up some summary statistics to put in the article. Or they did the experiment in some other way—for example, just giving some people more soup than others, with the experimentalists rationalizing it to themselves that this was essentially equivalent to that bottomless-bowl apparatus—and then fudged the data at the end to get statistically significant and publishable results.

Or maybe it all happened as described, and someone just mistyped a bunch of numbers which is why the values in the published paper didn’t add up.

To paraphrase Jordan Anaya: I dunno. If I’d just designed and carried out the most awesome experiment of my career—a design that some might call a “masterpiece”—I think I’d be pretty damn careful with the data that resulted. I’d’ve made something like 50 copies of the dataset to make sure it never got lost, and I’d triple-check all my analyses to make sure I didn’t make any mistakes. I might even bring in two trusted coauthors just to be 100% sure that there were no missteps. I wouldn’t want to ruin this masterpiece.

It’s as if Wansink had found some rare and expensive crystal goblet and then threw it in the back of a pickup truck to bring it home. A complete disconnect between the huge effort required to purportedly collect the data, and the zero or negative effort expended on making sure the data didn’t get garbled or destroyed.

Evidence that the experiment did happen

On the other hand . . .

Perhaps the strongest argument in favor of the experiment being real is that there were three authors on that published paper. So if the whole thing was mde up, it wouldn’t be just Brian Wansink doing the lying, it would also be James Painter and Jill North. That moves our speculation into the conspiracy category.

That said, we don’t know how the project was conducted. It might be that Wansink took responsibility for the data collection, and Painter and North were involved before and after and just took Wansink’s word for it that the experiment was actually done. Or maybe there is some other possibility.

Another piece of evidence in favor of the experiment being real is that Wansink and his colleagues put a lot of effort into explaining how the bowl worked. There are three paragraphs in Wansink et al. (2005) describing how they constructed the apparatus, how it worked, and how they operated it. Wansink also devotes a few pages of his book, Mindless Eating, to the soup experiment, providing further details; for example:

Our bottomless bowls failed to function during the first practice trial. The chicken noodle soup we were using either clogged the tubes or caused the soup to gurgle strangely. We bought 360 quarts of Campbells tomato soup, and started over.

I’m kinda surprised they ever thought the refilling bowl would work with chicken noodle soup—isn’t it obvious that it would clog the tube or clump in some way?—but, hey, dude’s a b-school professor, not a physicist, I guess we should cut him some slack.

Scrolling through the Mindless Eating on Amazon, I also came across this:

It seems that when estimating almost anything—such as weight, height, brightness, loudness, sweetness, and so on—we consistently underestimate things as they get larger. For instance, we’ll be fairly accurate at estimating the weight of a 2-pound rock but will grossly underestimate the weight of an 80-pound rock. . . .

They’re having people lift 80-pound rocks? That’s pretty heavy! I wonder what the experimental protocol for that is. (I guess they could ask people to estimate the weight of the rock by just looking at it, but that would be tough for lots of reasons.)

But I digress. To return to the soup experiment, Wansink also provides this story about one of the few people who had to be excluded from the data:

Cool story, huh? Not quite consistent with the published paper, which simply said that 54 participants were recruited for the study, but at least some recognition that moving the soup bowl could create a problem.


Did the experiment ever happen? I just don’t know! I see good arguments on both sides.

I can tell you one thing, though. Whether or not Wansink’s apparatus ever made its way out of the lab, it seems that the “bottomless soup bowl” has been used in at least one real experiment. I found this paper from 2012, Episodic Memory and Appetite Regulation in Humans, by Jeffrey Brunstrom et al., which explains:

Soup was added or removed from a transparent soup bowl using a peristaltic pump (see Figure 1). The soup bowl was presented in front of the volunteers and it was fixed to a table. A tall screen was positioned at the back of the table. This separated the participant from both the experimenter and a second table, supporting the pump and a soup reservoir. Throughout the experiment, the volunteers were unable to see beyond the screen.

The bottom of the soup bowl was connected to a length of temperature-insulated food-grade tubing. This connection was hidden from the participants using a tablecloth. The tubing fed through a hole in the table (immediately under the bowl) and connected to the pump and then to a reservoir of soup via a hole in the screen. The experimenter was able to manipulate the direction and rate of flow using an adjustable motor controller that was attached to the pump. The pre-heated soup was ‘creamed tomato soup’ (supplied by Sainsbury’s Supermarkets Ltd., London; 38 kcal/100 g).


Participants were then taken to a testing booth where a bowl of soup was waiting. They were instructed to avoid touching the bowl and to eat until the volume of soup remaining matched a line on the side of the bowl. The line ensured that eating terminated with 100 ml of soup remaining, thereby obscuring the bottom of the bowl.

So it does seem like the bottomless soup bowl experiment is possible, if done carefully. The above-linked article by Brunstrum et al. seems completely real. If it’s a fake, it’s fooled me! If it’s real, and Wansink et al. (2005) was fake, then this is a fascinating case of a real-life replication of a nonexistent study. Kind of like if someone were to breed a unicorn.

“The issue of how to report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly.”

Ricardo Vieira writes:

I recently came upon this study from Princeton published in PNAS:

Implicit model of other people’s visual attention as an invisible, force-carrying beam projecting from the eyes

In which the authors asked people to demonstrate how much you have to tilt an object before it falls. They show that when a human head is looking at the object in the direction that it is tilting, people implicitly rate the tipping point as being lower than when a person is looking in the opposite direction (as if the eyes either pushed the object down or prevented it from falling). They further show that no such difference emerges when the human head is blindfolded. The experiment a few times with different populations (online and local) and slight modifications.

In a subsequent survey, they found that actually 5% of the population seems to believe in some form of eye-beams (or extramission if you want to be technical).

I have a few issues with the article. For starters, they do not compare directly the non-blindfolded and blindfolded conditions, although they emphasize that the difference in the first is significant and in the second is not several times. This point was actually brought up in the blog Neuroskeptic. The author of the blog writes:

This study seems fairly solid, although it seems a little fortuitious that the small effect found by the n=157 Experiment 1 was replicated in the much smaller (and hence surely underpowered) follow-up experiments 2 and 3C. I also think the stats are affected by the old erroneous analysis of interactions error (i.e. failure to test the difference between conditions directly) although I’m not sure if this makes much difference here.

In the discussion that ensued, one of the study authors responds to the two points raised. I feel the first point is not that relevant, as the first experiment was done on mturk and the subsequent ones in a controlled lab, and the estimated standard errors were pretty similar across the board. Now on to the second point, the author writes:

The issue of how the report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly. First, it should be noted that each of the bars shown in the figure is already a difference between two means (mean angular tilt toward the face vs. mean angular tilt away from the face), not itself a raw mean. What we report, in each case, is a statistical test on a difference between means. If I interpret your argument correctly, it suggests that the critical comparison for us is not this tilt difference itself, but the difference of tilt differences. In our study, however, I would argue that this is not the case, for a couple of reasons:

In experiment 1 (a similar logic applies to exp 2), we explicitly spelled out two hypotheses. The first is that, when the eyes are open, there should be a significant difference between tilts toward the face and tilts away from the face. A significant different here would be consistent with a perceived force emanating from the eyes. Hence, we performed a specific, within-subjects comparison between means to test that specific hypothesis. Doing away with that specific comparison would remove the critical statistical test. Our main prediction would remain unexamined. Note that we carefully organized the text to lay out this hypothesis and report the statistics that confirm the prediction. The second hypothesis is that, when the eyes are closed, there should be no significant difference between tilts toward the face and tilts away from the face (null hypothesis). We performed this specific comparison as well. Indeed, we found no statistical evidence of a tilt effect when the eyes were closed. Thus, each hypothesis was put to statistical test. One could test a third hypothesis: any tilt difference effect is bigger when the eyes are open than when the eyes are closed. I think this is the difference of tilt differences asked for. However, this is not a hypothesis we put forward. We were very careful not to frame the paper in that way. The reason is that this hypothesis (this difference of differences) could be fulfilled in many ways. One could imagine a data set in which, when the eyes are open, the tilt effect is not by itself significant, but shows a small positivity; and when the eyes are closed, the tilt effect shows a small negativity. The combination could yield a significant difference of differences. The proposed test would then provide a false positive, showing a significant effect while the data actually do not support our hypotheses.

Of course, one could ask: why not include both comparisons, reporting on the tests we did as well as the difference of differences? There are at least two reasons. First, if we added more tests, such as the difference of differences, along with the tests we already reported, then we would be double-dipping, or overlapping statistical tests on the same data. The tests then become partially redundant and do not represent independent confirmation of anything. Second, as easy as it may sound, the difference-of-differences is not even calculatable in a consistent manner across all four experiments (e.g., in the control experiment 4), and so it does not provide a standardized way to evaluate all the results.

For all of these reasons, we believe the specific statistical methods reported in the manuscript are the simplest and the most valid. I totally understand that our statistics may seem to be affected by the erroneous analysis of interactions error, at first glance. But on deeper consideration, analyzing the difference-of-differences turns out to be somewhat problematical and also not calculatable for some of our data sets.

Is this reasonable?

My other issues relates to the actual effect. First the size of the difference is not clear (average difference is around 0.67 degrees, which are never described in terms of visual angle). I tried to draw two lines separated by 0.67 degrees on, and I couldn’t tell the difference unless they were superimposed, but I am not sure I got the scale correct. Second, they do not state in the article how much rotation is caused by each key-press (is this average difference equivalent to one key-press, half, two?). Finally, the participants do not see the full object rendered during the experiment, but just one vertical line. The authors argue that otherwise people would use heuristics such as move the top corner over the opposite bottom corner. This necessity seems to refute their hypothesis (if the eye-beam bias only work on lines, than they seem of little relevance to the 3d world).

Okay, perhaps what really bothers me is the last paragraph of the article:

We speculate that an automatic, implicit model of vision as a beam exiting the eyes might help to explain a wide range of cultural myths and associations. For example, in StarWars, a Jedi master can move an object by staring at it and concentrating the mind. The movie franchise works with audiences because it resonates with natural biases. Superman has beams that can emanate from his eyes and burn holes. We refer to the light of love and the light of recognition in someone’s eyes, and we refer to death as the moment when light leaves the eyes. We refer to the feeling of someone else’s gaze boring into us. Our culture is suffused with metaphors, stories, and associations about eye beams. The present data suggest that these cultural associations may be more than a simple mistake. Eye beams may remain embedded in the culture, 1,000 y after Ibn al-Haytham established the correct laws of optics (12), because they resonate with a deeper, automatic model constructed by our social machinery. The myth of extramission may tell us something about who we are as social animals.

Before getting to the details, let me share my first reaction, which is appreciation that Arvid Guterstam, one of the authors of the published paper, engaged directly with external criticism, rather than ignoring the criticism, dodging it, or attacking the messenger.

Second, let me emphasize the distinction between individuals and averages. In the above-linked post, Neuroskeptic writes:

Do you believe that people’s eyes emit an invisible beam of force?

According to a rather fun paper in PNAS, you probably do, on some level, believe that.

And indeed, the abstract of the article states: “when people judge the mechanical forces acting on an object, their judgments are biased by another person gazing at the object.” But this finding (to the extent that it’s real, in the sense of being something that would show up in a large study of the general population under realistic conditions) is a finding about averages. It could be that everyone behaves this way, or that most people behave this way, or that only some people behave this way: any of these can be consistent with an average difference.

Also Neuroskeptic’s summary takes a little poetic license, in that the study does not claim that most people believe that eyes emit any force; the claim is that people on average make certain judgments as if eyes emit that force.

This last bit is no big deal but I bring it up because there’s a big difference between people believing in the eye-beam force and implicitly reacting as if there was such a force. The latter can be some sort of cognitive processing bias, analogous in some ways to familiar visual and cognitive illusions that persist even if they are explained to you.

Now on to Vieira’s original question: did the original authors do the right thing in comparing significant to not significant? No, what they did was mistaken, for the usual reasons.

The author’s explanation quoted above is wrong, I believe in an instructive way. The author talks a lot about hypotheses and a bit about the framing of the data, but that’s not so relevant to the question of what can we learn from the data. Procedural discussions such as “double-dipping” also miss the point: Again, what we should want to know is what can be learned from these data (plus whatever assumptions go into the analysis), not how many times the authors “dipped” or whatever.

The fundamental fallacy I see in the authors’ original analysis, and in their follow-up explanation, is deterministic reasoning, in particular the idea that whether a comparison is “statistically significant” is equivalent to an effect being real.

Consider this snippet from Guterstam’s comment:

The second hypothesis is that, when the eyes are closed, there should be no significant difference between tilts toward the face and tilts away from the face (null hypothesis).

This is an error. A hypothesis should not be about statistical significance (or, in this case, no significant difference) in the data; it should be about the underlying or population pattern.

And this:

One could imagine a data set in which, when the eyes are open, the tilt effect is not by itself significant, but shows a small positivity; and when the eyes are closed, the tilt effect shows a small negativity. The combination could yield a significant difference of differences. The proposed test would then provide a false positive, showing a significant effect while the data actually do not support our hypotheses.

Again, the problem here is the blurring of two different things: (a) underlying effects and (b) statistically significant patterns in the data.

A big problem

The error of comparing statistical significance to non-significance is a little thing.

A bigger mistake is the deterministic attitude by which effects are considered there or not, the whole “false positive / false negative” thing. Lots of people, I expect most statisticians, don’t see this as a mistake, but it is one.

But an even bigger problem comes in this sentence from the author of the paper in question:

The issue of how the report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly.

He’s “quite sure”—but he’s wrong. This is a big, big, big problem. People are so so so sure of themselves.

Look. This guy could well be an excellent scientist. He has a Ph.D. He’s a neuroscientist. He knows a lot of stuff I don’t know. But maybe he’s not a statistics expert. That’s ok—not everyone should be a statistics expert. Division of labor! But a key part of doing good work is to have a sense of what you don’t know.

Maybe don’t be so quite sure next time! It’s ok to get some things wrong. I get things wrong all the time. Indeed, one of the main reasons for publishing your work is to get it out there, so that readers can uncover your mistakes. As I said above, I very much appreciate that the author of this article responded constructively to criticism. I think it’s too bad he was so sure of himself on the statistics, but even that is a small thing compared to his openness to discussion.

I agree with my correspondent

Finally, I agree with Vieira that the last paragraph of the article (“We speculate that an automatic, implicit model of vision as a beam exiting the eyes might help to explain a wide range of cultural myths and associations. For example, in StarWars, a Jedi master can move an object by staring at it and concentrating the mind. The movie franchise works with audiences because it resonates with natural biases. Superman has beams that can emanate from his eyes and burn holes. We refer to the light of love and the light of recognition in someone’s eyes, and we refer to death as the moment when light leaves the eyes. We refer to the feeling of someone else’s gaze boring into us. Our culture is suffused with metaphors, stories, and associations about eye beams. The present data suggest that these cultural associations may be more than a simple mistake. Eye beams may remain embedded in the culture, 1,000 y after Ibn al-Haytham established the correct laws of optics (12), because they resonate with a deeper, automatic model constructed by our social machinery. The myth of extramission may tell us something about who we are as social animals.”) is waaaay over the top. I mean, sure, who knows, but, yeah, this is story time outta control!

P.S. One amusing feature of this episode is that the above-linked comment thread has some commenters who seem to actually believe that eye-beams are real:

If “eye beam” is the proper term then I have no difficulty in registering my belief in them. Any habitué of the subway is familiar with the mysterious effect where looking at another’s face, who may be reading a book or be absorbed in his phone, maybe 20 or 30 feet away, will cause him suddenly to swivel his glance toward the onlooker. Let any who doubt experiment.

Just ask hunters or bird watchers if they exist. They know never to look directly at the animals head/eyes or they will be spooked.

I have had my arse saved by ‘sensing’ the gaze of others. This ‘effect’ is real. Completely subjective…yes. That I am here and able to write this comment…is a fact.

No surprise, I guess. There are lots of supernatural beliefs floating around, and it makes sense that they should show up all over, including on blog comment threads.

“I feel like the really solid information therein comes from non or negative correlations”

Steve Roth writes:

I’d love to hear your thoughts on this approach (heavily inspired by Arindrajit Dube’s work, linked therein):

This relates to our discussion from 2014:

My biggest takeaway from this latest: I feel like the really solid information therein comes from non or negative correlations:

• It comes before
• But it doesn’t correlate with ensuing (or it correlates negatively)

It’s pretty darned certain it isn’t caused by.

If smoking didn’t correlate with ensuing lung cancer (or correlated negatively), we’d say with pretty strong certainty that smoking doesn’t cause cancer, right?

By contrast, positive correlation only tells us that something (out of an infinity of explanations) might be causing the apparent effect of A on B. Non or negative correlation strongly disproves a hypothesis.

I’m less confident saying: if we don’t look at multiple positive and negative time lags for time series correlations, we don’t really learn anything from them?

More generally, this is basic Popper/science/falsification. The depressing takeaway: all we can really do with correlation analysis is disprove an infinite set of hypotheses, one at a time? Hoping that eventually we’ll gain confidence in the non-disproved causal hypotheses? Slow work!

It also suggests that file-drawer bias is far more pernicious than is generally allowed. The institutional incentives actually suppress the most useful, convincing findings? Disproofs?

(This all toward my somewhat obsessive economic interests: does wealth concentration/inequality cause slower economic growth one year, five years, twenty years later? The data’s still sparse…)

Roth summarizes:

“Dispositive” findings are literally non-positive. They dispose of hypotheses.

My reply:

1. The general point reminds me of my dictum that statistical hypothesis testing works the opposite way that people think it does. The usual thinking is that if a hyp test rejects, you’ve learned something, but if the test does not reject, you can’t say anything. I’d say it’s the opposite: if the test rejects, you haven’t learned anything—after all, we know ahead of time that just about all null hypotheses of interest are false—but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise.

2. That said, what you write can’t be literally true. Zero or nonzero correlations don’t stay zero or nonzero after you control for other variables. For example, if smoking didn’t correlate with lung cancer in observational data, sure, that would be a surprise, but in any case you’d have to look at other differences between the exposed and unexposed groups.

3. As a side remark, just reacting to something at the end of the your email, I continue to think that file drawer is overrated, given the huge number of researcher degrees of freedom, even in many preregistered studies (for example here). Researchers have no need to bury non-findings in the file drawer; instead they can extract findings of interest from just about any dataset.

What can be learned from this study?

James Coyne writes:

A recent article co-authored by a leading mindfulness researcher claims to address the problems that plague meditation research, namely, underpowered studies; lack of or meaningful control groups; and an exclusive reliance on subjective self-report measures, rather than measures of the biological substrate that could establish possible mechanisms.

The article claims adequate sample size, includes two active meditation groups and three control groups and relies on seemingly sophisticated strategy for statistical analysis. What could possibly go wrong?

I think the study is underpowered to detect meaningful differences between active treatment and control groups. The authors haven’t thought out precisely how to use the presence of multiple control groups. They rely on statistical significance is the criterion for the value of the meditation groups. But when it comes to a reckoning, they avoid inevitably nonsignificant results that would occur comparisons of changes over time inactive versus control groups. Instead they substitute with group analyses and peer whether the results are significant for active treatments, but not control groups.

The article does not present power analyses but simply states that “a sample of 135 is considered to be a good sample size for growth curve modeling (Curran, Obeidat, & Losardo, 2010) or mediation analyses for medium-to-large effects (Fritz & MacKinnon, 2007).

There are five groups, representing two active treatments and three control groups. That means that all the relevant action depends on group by time interaction effects in pairs of active treatment and control groups, with 27 participants in each cell.

I have seen a lot of clinical trials in psychological interventions, but never one with two active treatments and three control groups. In the abstract it may seem interesting, but I have no idea what research questions would be answered by this constellation. I can’t even imagine planned comparisons that would follow up on overall treatment (5) by time interaction effect

The analytic strategy was to examine whether there is an overall group by time interaction effect and then at examine within-group, pre/post differences for particular variables. When these within group differences are statistically significant for an active treatment group, but not for the control groups, it is considered a confirmation hypothesis that meditation is effective with respect to certain variables.

When there are within-differences for both psychological and biological variables, it is inferred that the evidence is consistent with the biological statement he psychological changes.

There are then mediational analysis that follow a standard procedure: construction of zero order correlation matrix; calculation of residual change scores for each individual with creation of dummy variables for four of the groups contrasted against the mutual control group. Simple mediation effects were then calculated for each psychological self-report variable with group assignment as the predictor variable and physiological variable as the moderator.

I think these mediational analyses are a wasted effort because of the small number of subjects exposed to each intervention.

At this point I would usually read the article, perhaps make some calculations, read some related things, figure out my general conclusions, and then write everything up.

This time I decided to do something different and respond in real time.

So I’ll give my response, labeling each step.

1. First impressions

The article in question is Soothing Your Heart and Feeling Connected: A New Experimental Paradigm to Study the Benefits of Self-Compassion, by Hans Kirschner, Willem Kuyken, Kim Wright, Henrietta Roberts, Claire Brejcha, and Anke Karl, and it begins:

Self-compassion and its cultivation in psychological interventions are associated with improved mental health and well- being. However, the underlying processes for this are not well understood. We randomly assigned 135 participants to study the effect of two short-term self-compassion exercises on self-reported-state mood and psychophysiological responses compared to three control conditions of negative (rumination), neutral, and positive (excitement) valence. Increased self-reported-state self-compassion, affiliative affect, and decreased self-criticism were found after both self-compassion exercises and the positive-excitement condition. However, a psychophysiological response pattern of reduced arousal (reduced heart rate and skin conductance) and increased parasympathetic activation (increased heart rate variability) were unique to the self-compassion conditions. This pattern is associated with effective emotion regulation in times of adversity. As predicted, rumination triggered the opposite pattern across self-report and physiological responses. Furthermore, we found partial evidence that physiological arousal reduction and parasympathetic activation precede the experience of feeling safe and connected.

My correspondent’s concern was that the sample size was too small . . . let’s look at that part of the paper:

We recruited a total of 135 university students in the United Kingdom (27 per experimental condition . . .)

OK, so yes I’m concerned. 27 seems small, especially for a between-person design.

But is N really too small? It depends on effect size and variation.

Let’s look at the data.

Here are the basic data summaries:

I think these are averages: each dot is the average of 27 people.

The top four graphs are hard to interpret: I see there’s more variation after than before, but beyond that I’m not clear what to make of this.

So I’ll focus on the bottom three graphs, which have more data. The patterns seem pretty clear, and I expect there is a high correlation across time. I’d like to see the separate lines for each person. That last graph, of skin conductance level, is particularly striking in that the lines go up and then down in synchrony.

What’s the story here? Skin conductance seems like a clear enough outcome, even if not of direct interest it’s something that can be measured. The treatments, recall, were “two short-term self-compassion exercises” and “three control conditions of negative (rumination), neutral, and positive (excitement) valence.” I’m surprised to see such clear patterns from these treatments. I say this from a position of ignorance; just based on general impressions I would not have known to expect such consistency.

2. Data analysis

OK, now we seem to be going beyond first impressions . . .

So what data would I like to see to understand these results better? I like the graphs above, and now I want something more that focuses on treatment effects and differences between groups.

To start with, how about we summarize each person’s outcome by a single number. I’ll focus on the last three outcomes (e, f, g) shown above. Looking at the graphs, maybe we could summarize each by the average measurement during times 6 through 11. So, for each outcome, I want a scatterplot. Let y_i be person i’s average outcome during times 6 through 11, and x_i is the outcome at baseline. For each outcome, let’s plot y_i vs x_i. That’s a graph with 135 dots, you could use 5 colors, one for each treatment. Or maybe 5 different graphs, I’m not sure. There are three outcomes, so that’s 3 graphs or a 3 x 5 grid.

I’d also suggest averaging the three outcomes for each person so now there’s one total score. Standardize each score and reverse-code as appropriate (I guess that in this case we’d flip the sign of outcome f when adding up these three). This would be the clear summary we’d need.

I have the luxury of not needing to make a summary judgment on the conclusions, so I’ll just say that I’d like to see some scatterplots before going forward.

3. Other impressions

The paper gives a lot of numerical summaries of this sort:

The Group × Time ANOVA revealed no significant main effect of group, F(4,130) = 1.03, p > .05, ηp2 = .03. However, the Time × Group interaction yielded significance, F(4, 130) = 24.46, p < .001, ηp2 = .43. Post hoc analyses revealed that there was a significant pre-to-post increase in positive affiliative affect in the CBS condition, F(1, 26) = 10.53, p = .003, ηp2 = .28, 95% CI = [2.00, 8.93], the LKM-S condition, F(1, 26) = 26.79, p < .001, ηp2 = .51, 95% CI = [5.43, 12.59] and, albeit smaller, for the positive condition, F(1, 26) = 6.12, p = .020, ηp2 = .19, 95% CI = [0.69, 7.46]. In the rumination condition there was a significant decrease in positive affiliative affect after the manipulation, F(1, 26) = 38.90, p < .001, ηp2 = .60, 95% CI = [–18.79, –9.48], whereas no pre-to-post manipulation difference emerged for the control condition, F(1, 26) = .49, p = 486, ηp2 = .01, 95% CI = [–4.77, 2.33]. Interestingly, an ANCOVA (see Supplemental Material) revealed that after induction, only individuals in the LKM-S condition reported significantly higher positive affiliative affect than those in the neutral condition, and individuals in the rumination condition reported significantly lower positive affiliative affect.

This looks like word salad—or, should I say, number salad—and full of forking paths. Just a mess, as it’s some subset of all the many comparisons that could be performed. I know this sort of thing is standard data-analytic practice in many fields of research, so it’s not like this paper stands out in a bad way; still, I don’t find these summaries to be at all helpful. I’d rather do a multilevel model.

And then there’s this:

No way. I’m not even gonna bother with this.

The paper concludes with some speculations:

Short-term self-compassion exercises may exert their beneficial effect by temporarily activating a low-arousal parasympathetic positive affective system that has been associated with stress reduction, social affiliation, and effective emotion regulation

Short-term self-compassion exercises may exert their beneficial effect by temporarily increasing positive self and reducing negative self-bias, thus potentially addressing cognitive vulnerabilities for mental disorders

I appreciate that the authors clearly labeled these as speculations, possibilities, etc., and the paper’s final sentences were also tentative:

We conclude that self-compassion reduces negative self-bias and activates a content and calm state of mind with a disposition for kindness, care, social connectedness, and the ability to self-soothe when stressed. Our paradigm might serve as a basis for future research in analogue and patient studies addressing several important outstanding questions.

4. Was the sample size too small?

The authors write:

Although the sample size in this study was based on a priori power calculation for medium effect sizes in mixed measures ANOVAs and the recruitment target was met, a larger sample size may have been desirable. Overall, a sample of 135 is considered to be a good sample size for growth curve modeling (Curran, Obeidat, & Losardo, 2010) or mediation analyses for medium-to-large effects (Fritz & MacKinnon, 2007). However, some of the effects were small-to-medium rather than medium and failed to reach significance, and thus a replication in a larger sample is warranted to check the robustness of our effects.

This raises some red flags to me, as it’s been my impression that real-life effects in psychology experiments are typically much smaller than what are called “medium effect sizes” in the literature. Also I think the above paragraph reveals some misunderstanding about effect sizes in that the authors are essentially doing post-hoc power analysis, not recognizing the high variability in effect size estimates; for more background on this point, see here and here.

The other point I want to return to is the between-person design. Without any understanding of this particular subfield, I’d recommend a within-person study in the future, where you try multiple treatments on each person. If you’re worried about poisoning the well, you could do different treatments on different days.

Speaking more generally, I’d like to move the question away from sample size and toward questions of measurement. Beyond the suggestion to perform multiple treatments on each person, I’ll return to my correspondent’s questions at the top of this post, which I can’t really evaluate myself, not knowing enough about this area.

Bayesian Computation conference in January 2020

X writes to remind us of the Bayesian computation conference:

– BayesComp 2020 occurs on 7-10 January 2020 in Gainesville, Florida, USA
– Registration is open with regular rates till October 14, 2019
– Deadline for submission of poster proposals is December 15, 2019
– Deadline for travel support applications is September 20, 2019
– Sessions are posted on
– There are four free tutorials on January 7, 2020, on Stan, NIMBLE, SAS, and AutoStat

SAS, huh?

Amending Conquest’s Law to account for selection bias

Robert Conquest was a historian who published critical studies of the Soviet Union and whose famous “First Law” is, “Everybody is reactionary on subjects he knows about.” I did some searching on the internet, and the most authoritative source seems to be this quote from Conquest’s friend Kingsley Amis:

Further search led to this elaboration from philosopher Roger Scruton:

. . .

I agree with Scruton that we shouldn’t take the term “reactionary” (dictionary definition, “opposing political or social progress or reform”) too literally. Even Conquest, presumably, would not have objected to the law forbidding the employment of children as chimney sweeps.

The point of Conquest’s Law is that it’s easy to propose big changes in areas distant from you, but on the subjects you know about, you will respect tradition more, as you have more of an understanding of why it’s there. This makes sense, although I can also see the alternative argument that certain traditions might seem to make sense from a distance but are clearly absurd when looked at from close up. I guess it depends on the tradition.

In the realm of economics, for example, Engels, Keynes, and various others had a lot of direct experience of capitalism but it didn’t stop them from promoting revolution and reform. That said, Conquest’s Law makes sense and is clearly true in many cases, even if not always.

What motivated me to write this post, though, was not these sorts of rare exceptions—after all, most people who are successful in business are surely conservative, not radical, in their economic views—but rather an issue of selection bias.

Conquest was a successful academic and hung out with upper-class people, Oxbridge graduates, various people who were closer to the top than the bottom of the social ladder. From that perspective it’s perhaps no surprise that they were “reactionary” in their professional environments, as they were well ensconced there. This is not to deny the sincerity and relevance of such views, any more than we would want to deny the sincerity and relevance of radical views held by people with less exalted social positions. I’m sure the typical Ivy League professor such as myself is much more content and “reactionary” regarding the university system, then would be a debt-laden student or harried adjunct. I knew some people who worked for minimum wage at McDonalds, and I think their take on the institution was a bit less reactionary than that of the higher-ups. This doesn’t mean that people with radical views want to tear the whole thing down (after all, people teach classes, work at McDonalds, etc., out of their own free will), nor that reactionaries want no change. My only point here is that the results of a survey, even an informal survey, of attitudes will depend on who you think of asking.

It’s interesting how statistical principles can help us better understand even purely qualitative statements.

A similar issue arose with baseball analyst Bill James. As I wrote a few years ago:

In 2001, James wrote:

Are athletes special people? In general, no, but occasionally, yes. Johnny Pesky at 75 was trim, youthful, optimistic, and practically exploding with energy. You rarely meet anybody like that who isn’t an ex-athlete—and that makes athletes seem special.

I’ve met 75-year-olds like that, and none of them was an ex-athlete. That’s probably because I don’t know a lot of ex-athletes. But Bill James . . . he knows a lot of athletes. He went to the bathroom with Tim Raines once! The most I can say is that I saw Rickey Henderson steal a couple bases in a game against against the Orioles.

Cognitive psychologists talk about the base-rate fallacy, which is the mistake of estimating probabilities without accounting for underlying frequencies. Bill James knows a lot of ex-athletes, so it’s no surprise that the youthful, optimistic, 75-year-olds he meets are likely to be ex-athletes. The rest of us don’t know many ex-athletes, so it’s no surprise that most of the youthful, optimistic, 75-year-olds we meet are not ex-athletes. The mistake James made in the above quote was to write “You” when he really meant “I.” I’m not disputing his claim that athletes are disproportionately likely to become lively 75-year-olds; what I’m disagreeing with is his statement that almost all such people are ex-athletes. Yeah, I know, I’m being picky. But the point is important, I think, because of the window it offers into the larger issue of people being trapped in their own environments (the “availability heuristic,” in the jargon of cognitive psychology). Athletes loom large in Bill James’s world—I wouldn’t want it any other way—and sometimes he forgets that the rest of us live in a different world.

Another way to put it: Selection bias. Using a non-representative sample to drawing inappropriate inferences about the population.

This does not make Conquest’s or James’s observations valueless. We just have to interpret them carefully given the data, to get something like:

Conquest: People near the top of a hierarchy typically like it there.

James: I [James] know lots of energetic elderly athletes. Most of the elderly non-athletes I know are not energetic.

Why does my academic lab keep growing?

Andrew, Breck, and I are struggling with the Stan group funding at Columbia just like most small groups in academia. The short story is that to apply for enough grants to give us a decent chance of making payroll in the following year, we have to apply for so many that our expected amount of funding goes up. So our group keeps growing, putting even more pressure on us in the future to write more grants to make payroll. It’s a better kind of problem to have than firing people, but the snowball effect means a lot of work beyond what we’d like to be doing.

Why does my academic lab keep growing?

Here’s a simple analysis. For the sake of argument, let’s say your lab has a $1.5M annual budget. And to keep things simple, let’s suppose all grants are $0.5M. So you need three per year to keep the lab afloat. Let’s say you have a well-oiled grant machine with a 40% success rate on applications.

Now what happens if you apply for 8 grants? There’s roughly a 30% chance you get fewer than the 3 grants you need, a 30% chance you get exactly the 3 grants you need, and a 40% chance you get more grants than you need.

If you’re like us, a 30% chance of not making payroll is more than you’d like, so let’s say you apply for 10 grants. Now there’s only a 20% chance you won’t make payroll (still not great odds!), a 20% chance you get exactly 3 grants, and a whopping 60% chance you wind up with 4 or more grants.

The more conservative you are about making payroll, the bigger this problem is.

Wait and See?

It’s not quite as bad as that analysis leads one to believe, because once a lab’s rolling, it’s usually working in two-year chunks, not one-year chunks. But that takes a while to build up that critical mass.

It would be great if you could apply and wait and see before applying again, but it’s not so easy. Most government grants have fixed deadlines, typically once or at most twice per year. The ones like NIH that have two submission periods/year have a tendency to no fund first applications. So if you don’t apply in a cycle, it’s usually at least another year before you can apply again. Sometimes special one-time-only opportunities with partners or funding agencies come up. We also run into problems like government shutdowns—I still have two NSF grants under review that have been backed up forever (we’ve submitted and heard back on other grants from NSF in the meantime).

The situation with Stan at Columbia

We’ve received enough grants to keep us going. But we have a bunch more in process, some of which we’re cautiously optimistic about. And we’ve already received about half a grant more than we anticipated, so we’re going to have to hire even if we don’t get the ones in process.

So if you know any postdocs or others who might want to work on the Stan language in OCaml and C++, let me know ( A more formal job ad will be out out soon.

Replication police methodological terrorism stasi nudge shoot the messenger wtf

Cute quote:

(The link comes from Stuart Richie.) Sunstein later clarified:

I’ll take Sunstein’s word that he no longer thinks it’s funny to attack people who work for open science and say that they’re just like people who spread disinformation. I have no idea what Sunstein thinks the “grain of truth” is, but I guess that’s his problem.

Last word on this particular analogy comes from Nick Brown:

The bigger question

The bigger question is: What the hell is going on here? I assume that Sunstein doesn’t think that “good people doing good and important work” would be Stasi in another life. Also, I don’t know who are “the replication police.” After all, it’s Cass Sunstein and Brian Wansink, not Nick Brown, Anna Dreber, Uri Simonson, etc., who’ve been appointed to policymaking positions within the U.S. government.

What this looks like to me is a sort of alliance of celebrities. The so-called “replication police” aren’t police at all—unlike the Stasi, they have no legal authority or military power. Perhaps even more relevant, the replication movement is all about openness, whereas the defenders of shaky science are often shifty about their data, their analyses, and their review processes. If you want a better political analogy, how about this:

The open-science movement is like the free press. It’s not perfect, but when it works it can be one of the few checks against powerful people and institutions.

I couldn’t fit in Stasi or terrorists here, but that’s part of the point: Brown, Dreber, Simonsohn, etc., are not violent terrorists, and they’re not spreading disinformation. Rather, they’re telling, and disseminating truths that are unpleasant to some well-connected people.

Following the above-linked thread led me to this excerpt that Darren Dahly noticed from Sunstein’s book Nudge:

Jeez. Citing Wansink . . . ok, sure, back in the day, nobody knew that those publications were so flawed. But to describe Wansink’s experiments as “masterpieces” . . . what’s with that? I guess I understand, kind of. It’s the fellowship of the celebrities. Academic bestselling authors gotta stick together, right?

Several problems with science reporting, all in one place

I’d like to focus on one particular passage from Sunstein’s reporting on Wansink:

Wansink asked the recipients of the big bucket whether they might have eaten more because of the size of their bucket. Most denied the possibility, saying, “Things like that don’t trick me.” But they were wrong.

This quote illustrates several problems with science reporting:

1. Personalization; scientist-as-hero. It’s all Wansink, Wansink, Wansink. As if he did the whole study himself. As we now know, Wansink was the publicity man, not the detail man. I don’t know if these studies had anyone attending to detail, at least when it came to data collection and analysis. But, again, the larger point is that the scientist-as-hero narrative has problems.

2. Neglect of variation. Even if the study were reported and analyzed correctly, it could still be that the subset of people who said they were not influenced by the size of the bucket were not influenced. You can’t know, based on the data collected in this between-person study. We’ve discussed this general point before: it’s a statistical error to assume that an average pattern applies to everyone, or even to most people.

3. The claim that people are easily fooled. Gerd Gigerenzer has written about this a lot: There’s a lot of work being done by psychologists, economists, etc., sending the message that people are stupid and easily led astray by irrelevant stimuli. The implication is that democratic theory is wrong, that votes are determined by shark attacks, college football games, and menstrual cycles, so maybe we, the voters, can’t be reasoned with directly, we just have to be . . . nudged.

It’s frustrating to me how a commentator such as Sunstein is so ready to believe that participants in that popcorn experiments were “wrong” and then at the same time so quick to attack advocates for open science. If the open science movement had been around fifteen years ago, maybe Sunstein and lots of others wouldn’t have been conned. Not being conned is a good thing, no?

P.S. I checked Sunstein’s twitter feed to see if there was more on this Stasi thing. I couldn’t find anything, but I did notice this link to a news article he wrote, evaluating the president’s performance based on the stock market (“In terms of the Dow, 2018 was also pretty awful, with a 5.6 percent decline — the worst since 2008.”) Is that for real??

P.P.S. Look. We all make mistakes. I’m sure Sunstein is well-intentioned, just as I’m sure that the people who call us “terrorists” etc. are well-intentioned, etc. It’s just . . . openness is a good thing! To look at people who work for openness and analogize them to spies whose entire existence is based on secrecy and lies . . . that’s really some screwed-up thinking. When you’re turned around that far, it’s time to reassess, not just issue semi-apologies indicating that you think there’s a “grain of truth” to your attack. We’re all on the same side here, right?

P.P.P.S. Let me further clarify.

Bringing up Sunstein’s 2008 endorsement of Wansink is not a “gotcha.”

Back then, I probably believed all those sorts of claims too. As I’ve written in great detail, the past decade has seen a general rise in sophistication regarding published social science research, and there’s lots of stuff I believed back then, that I wouldn’t trust anymore. Sunstein fell for the hot hand fallacy fallacy too, but then again so did I!

Here’s the point. From one standpoint, Brian Wansink and Cass Sunstein are similar: They’re both well-funded, NPR-beloved Ivy League professors who’ve written best-selling books. They go on TV. They influence government policy. They’re public intellectuals!

But from another perspective, Wansink and Sunstein are completely different. Sunstein cares about evidence, Wansink shows no evidence of caring about evidence. When Sunstein learns he made a mistake, he corrects it. When Wansink learns he made a mistake, he muddies the waters.

I think the differences between Sunstein and Wansink are more important than the similarities. I wish Sunstein would see this too. I wish he’d see that the scientists and journalists who want to open things up, to share data, to reveal their own mistakes as well as those of others, are on his side. And the sloppy researchers, those who resist open data, open methods, and open discussion, are not.

To put it another way: I’m disturbed that an influential figure such as Sunstein thinks that the junk science produced Brian Wansink and other purveyors of unreplicable research are “masterpieces,” while he thinks it’s “funny” with “a grain of truth” to label careful, thoughtful analysts such as Brown, Dreber, Simonson as “Stasi.” Dude’s picking the wrong side on this one.

As always, I think the best solution is not for researchers to just report on some preregistered claim, but rather for them to display the entire multiverse of possible relevant results.

I happened to receive these two emails in the same day.

Russ Lyons pointed to this news article by Jocelyn Kaiser, “Major medical journals don’t follow their own rules for reporting results from clinical trials,” and Kevin Lewis pointed to this research article by Kevin Murphy and Herman Aguinis, “HARKing: How Badly Can Cherry-Picking and Question Trolling Produce Bias in Published Results?”

Both articles made good points. I just wanted to change the focus slightly, to move away from the researchers’ agency and to recognize the problem of passive selection, which is again why I like to speak of forking paths rather than p-hacking.

As always, I think the best solution is not for researchers to just report on some preregistered claim, but rather for them to display the entire multiverse of possible relevant results.

“Beyond ‘Treatment Versus Control’: How Bayesian Analysis Makes Factorial Experiments Feasible in Education Research”

Daniel Kassler, Ira Nichols-Barrer, and Mariel Finucane write:

Researchers often wish to test a large set of related interventions or approaches to implementation. A factorial experiment accomplishes this by examining not only basic treatment–control comparisons but also the effects of multiple implementation “factors” such as different dosages or implementation strategies and the interactions between these factor levels. However, traditional methods of statistical inference may require prohibitively large sample sizes to perform complex factorial experiments.

We present a Bayesian approach to factorial design. Through the use of hierarchical priors and partial pooling, we show how Bayesian analysis substantially increases the precision of estimates in complex experiments with many factors and factor levels, while controlling the risk of false positives from multiple comparisons.

Using an experiment we performed for the U.S. Department of Education as a motivating example, we perform power calculations for both classical and Bayesian methods. We repeatedly simulate factorial experiments with a variety of sample sizes and numbers of treatment arms to estimate the minimum detectable effect (MDE) for each combination.

The Bayesian approach yields substantially lower MDEs when compared with classical methods for complex factorial experiments. For example, to test 72 treatment arms (five factors with two or three levels each), a classical experiment requires nearly twice the sample size as a Bayesian experiment to obtain a given MDE.

They conclude:

Bayesian methods are a valuable tool for researchers interested in studying complex interventions. They make factorial experiments with many treatment arms vastly more feasible.

I love it. This is stuff that I’ve been talking about for a long time but have never actually done. These people really did it. Progress!

Here are some examples of real-world statistical analyses that don’t use p-values and significance testing.

Joe Nadeau writes:

I’ve followed the issues about p-values, signif. testing et al. both on blogs and in the literature. I appreciate the points raised, and the pointers to alternative approaches. All very interesting, provocative.

My question is whether you and your colleagues can point to real world examples of these alternative approaches. It’s somewhat easy to point to mistakes in the literature. It’s harder, and more instructive, to learn from good analyses of empirical studies.

My reply:

I have lots of examples of alternative approaches; see the applied papers here.

And here are two particular examples:

The Millennium Villages Project: a retrospective,observational, endline evaluation

Analysis of Local Decisions Using Hierarchical Modeling, Applied to Home Radon Measurement and Remediation

Attorney General of the United States less racist than Nobel prize winning biologist

This sounds pretty bad:

The FBI was better off when “you all only hired Irishmen,” [former Attorney General] Sessions said in one diatribe about the bureau’s workforce. “They were drunks but they could be trusted. . . .”

But compare to this from Mister Helix:

[The] historic curse of the Irish . . . is not alcohol, it’s not stupidity. . . it’s ignorance. . . . some anti-Semitism is justified. Just like some anti-Irish feeling is justified . . .

And “who would want to adopt an Irish kid?”

Watson elaborated:

You can be real dumb or you can seem dumb because you don’t know anything — that’s all I’m saying. The Irish seemed dumb because they didn’t know anything.

He seems to be tying himself into knots, trying to reconcile old-style anti-Irish racism with modern-day racism in which there’s a single white race. You see, he wants to say Irish are inferior, but he can’t say their genes are worse, so he puts it down to “ignorance.”

Overall, I’d have to say Sessions is less of a bigot: Sure, he brings in a classic stereotype, but in a positive way!

Lots of us say stupid and obnoxious things in private. One of the difficulties of being a public figure is that even your casual conversation can be monitored. It must be tough to be in that position, and I can see how at some point you might just give up and let it all loose, Sessions or Watson style, and just go full-out racist.

For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

Statistical models are placeholders. We lay down a model, fit it to data, use the fitted model to make inferences about quantities of interest (qois), check to see if the model’s implications are consistent with data and substantive information, and then go back to the model and alter, fix, update, augment, etc.

Given that models are placeholders, we’re interested in the dependence of inferences on model assumptions. In particular, with Bayesian inference we’re often concerned about the prior.

With that in mind, awhile ago I came up with this recommendation.

For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

The idea here is that if the prior distribution is informative in this way, it can make sense to think harder about it, rather than just accepting the placeholder.

I’ve been interested in using this idea and formalizing it, and then the other day I got an email from Virginia Gori, who wrote:

I recently read your contribution to the Stan wiki page on priors choice recommendations, suggesting to ensure that the ratio of the standard deviations of the posterior and the prior(s) is at least 0.1 to assesss how informative priors are.

I found it very useful, and would like to use it in a publication. Searching online, I could only find this criteria in the Stan manual. I wonder if there’s a peer reviewed publication on this I should reference.

I have no peer-reviewed publication, or even any clear justification of the idea, nor have I seen it in the literature. But it could be there.

So this post serves several functions:

– It’s something that Gori can point to as a reference, if the wiki isn’t enough.

– It’s a call for people (You! Blog readers and commenters!) to point us to any relevant literature, including ideally some already-written paper by somebody else proposing the above idea.

– It’s a call for people (You! Blog readers and commenters!) to suggest some ideas for how to write up the above idea in a sensible way so we can have an Arxiv paper on the topic.

Conditional probability and police shootings

A political scientist writes:

You might have already seen this, but in case not: PNAS published a paper [Officer characteristics and racial disparities in fatal officer-involved shootings, by David Johnson, Trevor Tress, Nicole Burkel, Carley Taylor, and Joseph Cesario] recently finding no evidence of racial bias in police shootings:

Jonathan Mummolo and Dean Knox noted that the data cannot actually lead to any substantive conclusions one way or another, because the authors invert the conditional probability of interest (actually, the problem is a little more complicated, involving assumptions about base rates). They wrote a letter to PNAS pointing this out, but unfortunately PNAS decided not to publish it.

Maybe blogworthy? (If so, maybe immediately rather than on normal lag given prominence of study?)

OK, here it is.

Multilevel Bayesian analyses of the growth mindset experiment

Jared Murray, one of the coauthors of the Growth Mindset study we discussed yesterday, writes:

Here are some pointers to details about the multilevel Bayesian modeling we did in the Nature paper, and some notes about ongoing & future work.

We did a Bayesian analysis not dissimilar to the one you wished for! In section 8 of the supplemental material to the Nature paper, you’ll find some information about the Bayesian multilevel model we fit, starting on page 46 with the model statement and some information about priors below (variable definitions are just above). If you squint at the nonparametric regression functions and imagine them as linear, this is a pretty vanilla Bayesian multilevel model with school varying intercepts and slopes (on the treatment indicator). (For the Nature analysis all our potential treatment effect moderators are at the school level.) But the nonparametric prior distribution on those functions is actually imposing the kind of partial pooling you wanted to see, and in the end our Bayesian analysis produces substantively similar findings as the “classical” analysis, including strong evidence of positive average treatment effects and the same patterns of treatment effect heterogeneity.

The model & prior we use is a multilevel adaptation of the modeling approach we (Richard Hahn, Carlos Carvalho, and I) described in our paper “Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects.” In that paper we focused on observational studies and the pernicious effects of even completely observed confounding. But the parameterization we use there is useful in general, including RCTs like the Mindset study. In particular:

1) Explicitly parameterizing the model in terms of the conditional average treatment effect function (lambda in the Nature materials, tau in our arxiv preprint) is important so we can include in the model many variables measured at baseline (to reduce residual variance) while also restricting our attention to a smaller subset of theoretically-motivated potential treatment effect moderators.

2) Perhaps more importantly, in this parameterization we are able to put a prior on the nonparametric treatment effect function (tau/lambda) directly. This way we can control the nature and degree of regularization/shrinkage/partial pooling. For our model which uses a BART prior on the treatment effect function this amounts to careful priors on how deep the trees grow and how far the leaf parameters vary from zero (and to a lesser extent the number of trees). As you suggest, our prior shrinks all the treatment effects toward zero, and also shrinks the nonparametric conditional average treatment effect function tau/lambda toward something that’s close to additive. If that function were exactly additive we’d have only two-way covariate by treatment interactions which seems like a sensible target to shrink towards. (As an aside that might be interesting to you and your readers, this kind of shrinkage is an advantage of BART priors over many alternatives like commonly used Gaussian process priors).

These are important points of divergence of our work from the multitude of “black box” methods for estimating heterogeneous treatment effects non/semiparametrically, including Jennifer’s (wonderful!) work on BART for causal inference.

In terms of what we presented in the Nature paper we were a little constrained by the pre-registration plan, which fixed before some of us joined the team. In turn that prereg plan was constrained by convention—unfortunately, it would probably have been difficult or impossible at the time to fund the study and publish this paper in a similar venue without a prereg plan that primarily focused on the classical analysis and some NHST. [Indeed in my advice to this research team a couple years ago, I advised them to start with the classical analysis and then move to the multilevel model. —AG.] In terms of the Bayesian analysis we did present, we were limited by space considerations in the main document and a desire to avoid undermining later papers by burying new stats in supplemental materials.

We’re working on another paper that foregrounds the potential of Bayesian modeling for these kinds of problems and illustrates how it could enhance and simplify the design and analysis of a study like the NSLM. I think our approach will address many of your critiques: Rather than trying to test multiple competing hypotheses/models, we estimate a rich model of conditional average treatment effects with carefully specified, weakly informative prior distributions. Instead of “strewing the text with p-values”, we focus on different ways to summarize the posterior distribution of the treatment effect function (i.e. the covariate by treatment interactions). We do this via subgroup finding in our arxiv paper above (we kept it simple there, but those subgroup estimates are in fact the Bayes estimates of subgroups under a reasonable loss function). Of course given any set of interesting subgroups we can obtain the joint posterior distribution of subgroup average treatment effects directly once we have posterior samples, which we do in the Nature paper. The subgroup finding exercise is an instance of a more general approach to summarizing the posterior distribution over complex functions by projecting each draw onto a simpler proxy or summary, an idea we (Spencer Woody, Carlos and I) explore in a predictive context in another preprint, “Model interpretation through lower-dimensional posterior summarization.”

If you want to get an idea of what this looks like when it all comes together, here are slides from a couple of recent talks I’ve given (one at SREE aimed primarily at ed researchers, and the other at the Bayesian Nonparameterics meeting last June).

In both cases the analysis I presented diverges from the analysis in the Nature paper (the outcome in these talks is just math GPA, and I looked at the entire population of students rather than lower achieving students as in the Nature paper). So while we find similar patterns of treatment effect heterogeneity as in the Nature paper, the actual treatment effects aren’t directly comparable because the outcomes and populations are different. Anyway, these should give you a sense for the kinds of analyses we’re currently doing and hoping to normalize going forward. Hopefully the Nature paper helps that process along by showing a Bayesian analysis alongside a more conventional one.

It’s great to see statisticians and applied researchers working together in this way.

“Study finds ‘Growth Mindset’ intervention taking less than an hour raises grades for ninth graders”

I received this press release in the mail:

Study finds ‘Growth Mindset’ intervention taking less than an hour raises grades for ninth graders

Intervention is first to show national applicability, breaks new methodological ground

– Study finds low-cost, online growth mindset program taking less than an hour can improve ninth graders’ academic achievement
– The program can be used for free in high schools around U.S. and Canada
– Researchers developed rigorous new study design that can help identify who could benefit most from intervention and under which social contexts

A groundbreaking study of more than 12,000 ninth grade U.S. students has revealed how a brief, low-cost, online program that takes less than an hour to complete can help students develop a growth mindset and improve their academic achievement. A growth mindset is the belief that a person’s intellectual abilities are not fixed and can be further developed.

Published in the journal Nature on August 7, the nationally representative study showed that both lower- and higher-achieving students benefited from the program. Lower-achieving students had significantly higher grades in ninth grade, on average, and both lower- and higher-achieving students were more likely to enroll in more challenging math courses their sophomore year. The program increased achievement as much as, and in some cases more than, previously evaluated, larger-scale education interventions costing far more and taking far longer to complete. . . .

The National Study of Learning Mindsets is as notable for its methodology to investigate the differences, or heterogeneity, in treatment effects . . . the first time an experimental study in education or social psychology has used a random, nationally representative sample—rather than a convenience sample . . .

Past studies have shown mixed effects for growth mindset interventions, with some showing small effects and others showing larger ones.

“These mixed findings result from both differences in the types of interventions, as well as from not using nationally representative samples in ways that rule out other competing hypotheses,” [statistician Elizabeth] Tipton said. . . .

The researchers hypothesized that the effects of the mindset growth intervention would be stronger for some types of schools and students than others and designed a rigorous study that could test for such differences. Though the overall effect might be small when looking at all schools, particular types of schools, such as those performing in the bottom 75% of academic achievement, showed larger effects from the intervention.

More here.

I’m often skeptical about studies that appear in the tabloids and get promoted via press release, and I guess I’m skeptical here too—but I know a lot of the people involved in this one, and I think they know what they’re doing. Also I think I helped out in the design of this study, so it’s not like I’m a neutral observer here.

One thing that does bother me is all the p-values in the paper and, in general, the reliance on classical analysis. Given that the goal of this research is to recognize variation in treatment effects, I think it should be reasonable to expect lots of the important aspects of the model to not be estimated very precisely from data (remember 16). So I’m thinking that, instead of strewing the text with p-values, there should be a better way to summarize inferences for interactions. Along similar lines, I’m guessing they could do better using Bayesian multilevel analysis to partially pool estimated interactions toward zero, rather than simple data comparisons which will be noisy. I recognize that many people consider classical analysis to be safer or more conservative, but statistical significance thresholding can just add noise; I think it’s partial pooling that will give results that are more stable and more likely to stand up under replication. This is not to say that I think the conclusions in the article are wrong; also, just at the level of the statistics, I think by far the most important issues are those identified by Tipton in the above-linked press release. I just think there’s more that can be done. Later on in the article they do include multilevel models, and so maybe it’s just that I’d like to see those analyses, including non-statistically-significant results, more fully incorporated into the discussion.

It appears the data and code are available here, so other people can do their own analyses, perhaps using multilevel modeling and graphical displays of grids of comparisons (but not grids of p-values; see discussion here) get a clearer picture of what can be learn from the data.

In any case, this topic is potentially very important—a effective intervention lasting an hour—so I’m glad that top statisticians and education researchers are working on it. Here’s how Yeager et al. conclude:

The combined importance of belief change and school environments in our study underscores the need for interdisciplinary research to understand the numerous influences on adolescents’ developmental trajectories.

P.S. More here from Jared Murray, one of the authors of this study.

Hey, look! The R graph gallery is back.

We’ve recommended the R graph gallery before, but then it got taken down.

But now it’s back! I wouldn’t use it on its own as a teaching tool, in that it has a lot of graphs that I would not recommend (see here), but it’s a great resource, so thanks so much to Yan Holtz for putting this together. He has a Python graph gallery too at the same site.

You are invited to join Replication Markets

Anna Dreber writes:

Replication Markets (RM) invites you to help us predict outcomes of 3,000 social and behavioral science experiments over the next year. We actively seek scholars with different voices and perspectives to create a wise and diverse crowd, and hope you will join us.

We invite you – your students, and any other interested parties – to join our crowdsourced prediction platform. By mid-2020 we will rate the replicability of claims from more than 60 academic journals. The claims were selected by an independent team that will also randomly choose about 200 for testing (replication).

• RM’s forecasters bet on the chance that a claim will replicate and may adjust their assessment after reading the original paper and discussing results with other players. Previous replication studies have demonstrated prediction accuracy of about 75% with these methods.

• RM’s findings will contribute to the wider body of scientific knowledge with a high-quality dataset of claim reliabilities, comparisons of several crowd aggregation methods, and insights about predicting replication. Anonymized data from RM will be open-sourced to train artificial intelligence models and speed future ratings of research claims.

• RM’s citizen scientists predict experimental results in a play-money market with real payouts totaling over $100K*. Payouts will be distributed among the most accurate of its anticipated 500 forecasters. There is no cost to play the Replication Markets.

Our project needs forecasters like you with knowledge, insight, and expertise in fields across the social and behavioral sciences. Please share this invitation with colleagues, students, and others who might be interested in participating.

Here’s the link to their homepage. And here’s how to sign up.

I know about Anna from this study from 2015 where she and her colleagues tried and failed to replicate a much publicized experiment from psychology (“The samples were collected in privacy, using passive drool procedures, and frozen immediately”), and then from a later study that she and some other colleagues did, using prediction markets to estimate the reproducibility of scientific research.

P.S. I do have some concerns regarding statements such as, “we will rate the replicability of claims from more than 60 academic journals.” I have no problem with the 50 journals; my concern is with the practice of declaring a replication a “success” or “failure.” And, yes, I know I just did this in the paragraph above! It’s a problem. We want to get definitive results, but definitive results are not always possible. A key issue here is the distinction between truth and evidence. We can say confidently that a particular study gives no good evidence for its claims, but that doesn’t mean those claims are false. Etc.

Are supercentenarians mostly superfrauds?

Ethan Steinberg points to a new article by Saul Justin Newman with the wonderfully descriptive title, “Supercentenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans,” which begins:

The observation of individuals attaining remarkable ages, and their concentration into geographic sub-regions or ‘blue zones’, has generated considerable scientific interest. Proposed drivers of remarkable longevity include high vegetable intake, strong social connections, and genetic markers. Here, we reveal new predictors of remarkable longevity and ‘supercentenarian’ status. In the United States, supercentenarian status is predicted by the absence of vital registration. The state-specific introduction of birth certificates is associated with a 69-82% fall in the number of supercentenarian records. In Italy, which has more uniform vital registration, remarkable longevity is instead predicted by low per capita incomes and a short life expectancy. Finally, the designated ‘blue zones’ of Sardinia, Okinawa, and Ikaria corresponded to regions with low incomes, low literacy, high crime rate and short life expectancy relative to their national average.

In summary:

As such, relative poverty and short lifespan constitute unexpected predictors of centenarian and supercentenarian status, and support a primary role of fraud and error in generating remarkable human age records.

Supercentenarians are defined as “individuals attaining 110 years of age.”

I’ve skimmed the article but not examined the data or the analysis—we can leave that to the experts—but, if what Newman did is correct, it’s a great story about the importance of measurement in learning about the world.