Multilevel Bayesian analyses of the growth mindset experiment

Jared Murray, one of the coauthors of the Growth Mindset study we discussed yesterday, writes:

Here are some pointers to details about the multilevel Bayesian modeling we did in the Nature paper, and some notes about ongoing & future work.

We did a Bayesian analysis not dissimilar to the one you wished for! In section 8 of the supplemental material to the Nature paper, you’ll find some information about the Bayesian multilevel model we fit, starting on page 46 with the model statement and some information about priors below (variable definitions are just above). If you squint at the nonparametric regression functions and imagine them as linear, this is a pretty vanilla Bayesian multilevel model with school varying intercepts and slopes (on the treatment indicator). (For the Nature analysis all our potential treatment effect moderators are at the school level.) But the nonparametric prior distribution on those functions is actually imposing the kind of partial pooling you wanted to see, and in the end our Bayesian analysis produces substantively similar findings as the “classical” analysis, including strong evidence of positive average treatment effects and the same patterns of treatment effect heterogeneity.

The model & prior we use is a multilevel adaptation of the modeling approach we (Richard Hahn, Carlos Carvalho, and I) described in our paper “Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects.” In that paper we focused on observational studies and the pernicious effects of even completely observed confounding. But the parameterization we use there is useful in general, including RCTs like the Mindset study. In particular:

1) Explicitly parameterizing the model in terms of the conditional average treatment effect function (lambda in the Nature materials, tau in our arxiv preprint) is important so we can include in the model many variables measured at baseline (to reduce residual variance) while also restricting our attention to a smaller subset of theoretically-motivated potential treatment effect moderators.

2) Perhaps more importantly, in this parameterization we are able to put a prior on the nonparametric treatment effect function (tau/lambda) directly. This way we can control the nature and degree of regularization/shrinkage/partial pooling. For our model which uses a BART prior on the treatment effect function this amounts to careful priors on how deep the trees grow and how far the leaf parameters vary from zero (and to a lesser extent the number of trees). As you suggest, our prior shrinks all the treatment effects toward zero, and also shrinks the nonparametric conditional average treatment effect function tau/lambda toward something that’s close to additive. If that function were exactly additive we’d have only two-way covariate by treatment interactions which seems like a sensible target to shrink towards. (As an aside that might be interesting to you and your readers, this kind of shrinkage is an advantage of BART priors over many alternatives like commonly used Gaussian process priors).

These are important points of divergence of our work from the multitude of “black box” methods for estimating heterogeneous treatment effects non/semiparametrically, including Jennifer’s (wonderful!) work on BART for causal inference.

In terms of what we presented in the Nature paper we were a little constrained by the pre-registration plan, which fixed before some of us joined the team. In turn that prereg plan was constrained by convention—unfortunately, it would probably have been difficult or impossible at the time to fund the study and publish this paper in a similar venue without a prereg plan that primarily focused on the classical analysis and some NHST. [Indeed in my advice to this research team a couple years ago, I advised them to start with the classical analysis and then move to the multilevel model. —AG.] In terms of the Bayesian analysis we did present, we were limited by space considerations in the main document and a desire to avoid undermining later papers by burying new stats in supplemental materials.

We’re working on another paper that foregrounds the potential of Bayesian modeling for these kinds of problems and illustrates how it could enhance and simplify the design and analysis of a study like the NSLM. I think our approach will address many of your critiques: Rather than trying to test multiple competing hypotheses/models, we estimate a rich model of conditional average treatment effects with carefully specified, weakly informative prior distributions. Instead of “strewing the text with p-values”, we focus on different ways to summarize the posterior distribution of the treatment effect function (i.e. the covariate by treatment interactions). We do this via subgroup finding in our arxiv paper above (we kept it simple there, but those subgroup estimates are in fact the Bayes estimates of subgroups under a reasonable loss function). Of course given any set of interesting subgroups we can obtain the joint posterior distribution of subgroup average treatment effects directly once we have posterior samples, which we do in the Nature paper. The subgroup finding exercise is an instance of a more general approach to summarizing the posterior distribution over complex functions by projecting each draw onto a simpler proxy or summary, an idea we (Spencer Woody, Carlos and I) explore in a predictive context in another preprint, “Model interpretation through lower-dimensional posterior summarization.”

If you want to get an idea of what this looks like when it all comes together, here are slides from a couple of recent talks I’ve given (one at SREE aimed primarily at ed researchers, and the other at the Bayesian Nonparameterics meeting last June).

In both cases the analysis I presented diverges from the analysis in the Nature paper (the outcome in these talks is just math GPA, and I looked at the entire population of students rather than lower achieving students as in the Nature paper). So while we find similar patterns of treatment effect heterogeneity as in the Nature paper, the actual treatment effects aren’t directly comparable because the outcomes and populations are different. Anyway, these should give you a sense for the kinds of analyses we’re currently doing and hoping to normalize going forward. Hopefully the Nature paper helps that process along by showing a Bayesian analysis alongside a more conventional one.

It’s great to see statisticians and applied researchers working together in this way.

1. Anonymous says:

I reason, and would like to sincerely note, that i think all the fancy statistics in the world do not solve problems due to a possibly flawed design of the study.

I like to refer to the discussion of the study (linked to above in the post) for some comments concerning that stuff.

• Anoneuoid says:

I didn’t check but mathematical thinking is pretty common. Many people think if you transform, adjust, interpolate, and average enough dirty data it can turn garbage into gold. See that recent paper where they average 600 million meaningless numbers to somehow get a meaningful number…

I can see the attraction because what this can do is give you a self-consistent view of what happened that seems to make sense. However, it is not neccesarily accurate so what they need to do is derive a surprising prediction from this model and then check it against new data.

• Anoneuoid says:

Autocorrect typo:
* mathematical -> mathemagical

• Anonymous says:

Quote from above: “Many people think if you transform, adjust, interpolate, and average enough dirty data it can turn garbage into gold. See that recent paper where they average 600 million meaningless numbers to somehow get a meaningful number…”

This reminded me of Meehl (1967) “Theory-testing in Psychology and Physics: A methodological paradox”

“Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.”

• Anoneuoid says:

As far as I’m concerned that paper is the most important one written since the 1950s, should be required reading and receive all sorts of prizes. But still, I don’t think a GIGO problem is what he was talking about.

He was talking about a logical problem. There is a many to one mapping of research hypotheses to the statistical “alternative hypothesis” used in the standard NHST (that there is some relationship between two variables). So, we don’t really learn anything from rejecting the default “null hypothesis”. It could mean any number of things are going on, some interesting some mundane (equipment malfunction, etc).

But people use rejection of the null hypothesis incorrectly to affirm the consequent by accepting their favorite research hypothesis (and ignoring, or only giving token lip service to, alternative explanations).

The “potent-but-sterile intellectual rake” strings together a series of these to generate a bunch of misinformation about their topic of interest.

• Anonymous says:

1) Quote from above: “As far as I’m concerned that paper is the most important one written since the 1950s, should be required reading and receive all sorts of prizes”

I agree that it should be required reading! I am however NOT a fan of prizes or awards in science.

I think Meehl’s paper is at times difficult to fully comprehend for me personally. I think this might be partly due to me being really bad in statistics, and the logic of hypothesis testing stuff is hard for me to comprehend. I also always feel Meehl at times writes more like a novel writer or poet than a scientist. This usually makes it harder for me to then try and extract, and/or determine, the crucial information.

2) Quote from above: “But still, I don’t think a GIGO problem is what he was talking about. He was talking about a logical problem”

I purposefully used the word “reminded” to try and be careful in not wanting to make a direct comparison with something like for instance the “Garbage In Garbage Out” (GIGO) thing that i think you are refering to.

The point i was trying to make, and how i interpret Meehl’s quote above, is that i think there are many other things to consider concerning doing “good” science besides statistics. And that a focus on (“fancy”) statistics (“exactitude of modern statistical hypothesis testing”) may take attention away from other, possibly, more important things (like the design of the study).

I think my interpretation of (the quote by) Meehl (1967) might be fair, as the following is written in the abstact of Meehl’s 1967 paper: “This problem is worsened by certain unhealthy tendencies prevalent among psychologists, such as a premium placed on experimental “cuteness” and a free reliance upon ad hoc explanations to avoid refutation.”

Also, the text just before the quote on page 113 and 114 of Meehl’s paper seem to me to explicitly mention additional things that are problematic next to the (logic of) hypothesis stuff. In my interpretation this involves the actual design of the experiment which was one of the things i thought might be probelematic in the “growth mindset” study, and what i wanted to make clear in my 1st comment.

• Anoneuoid says:

the logic of hypothesis testing stuff is hard for me to comprehend.

You have to be careful because there is something called “hypothesis testing” (devised by Jerzy Neyman and Egon Pearson), and also something called “significance testing” (devised by Ronald Fisher). Almost no one uses those methods, so while they may have problems it is really not worth being concerned about.

However, there is an extremely popular method called NHST, or “null hypothesis significance testing”m (apparently devised by stats 101 textbook writers in the 1940s such as EF Lindquist) which is a nonsensical mixture of significance and hypothesis testing.

No one should be expected to comprehend NHST, because it makes no sense. Gerd Gigerenzer’s paper “Mindless Statistics” is a decent intro to this “hybrid”: https://www.sciencedirect.com/science/article/abs/pii/S1053535704000927

• Anonymous says:

Quote from above: “Gerd Gigerenzer’s paper “Mindless Statistics” is a decent intro to this “hybrid”

Thank you for the link to Gigerenzer’s paper, which i think would be a nice addition to the required reading list!

I read it (inbetween cooking, and eating, dinner), it already made things more clear to me, but should read it again to fully grasp it all.

Since i don’t want to be involved with doing actual science myself anymore, i will refrain from doing so because i would like to spend my time differently.

If i was still (attempting to) doing research, i would plan to read it a 2nd time though.

• Anonymous says:

Quote from above: “Thank you for the link to Gigerenzer’s paper, which i think would be a nice addition to the required reading list!”

I thought about it about 2 days ago, but i am commening on it now: i wonder if it could be interesting, fun, useful, etc. when this blog would have a “Required Reading” or “You should definitely have read this one”- day. Reading papers like those of Meehl and Gigerenzer mentioned above can possibly help remind, or make clear, to people that papers (and scientists) like this exist.

On this “Required Reading”-day, certain papers that may not be known to (young) (social) scientists could be given some extra attention. This could, for instance, be done on the 1st Sunday of the month or something like that. I think that frequency would not be (-come) annoying, and on Sunday people may perhaps have a little more time to read an entire paper and think and comment about something a bit different, and it would be sort of something that people would know about after a while.

It could be especially suitable for discussing the “classics of the classics” type of papers, and perhaps also the “should be a classics of the classics” papers. The blogpost could briefly mention the content of the paper, or use a specific quote to make people curious, or simply copy-paste the abstract, or simply be attached to a fun or interesting story, etc. (basically how things go on this blog already).

And, suggestions for a “Required Reading” paper could be made in a specific blogpost that could be written for this specific purpose. If this would subsequently be mentioned with every “Required Reading” blogpost, people can go there and suggest something. This will also possibly help concerning people not talking about “Required Reading” paper X in a specific blogpost about “Required Reading” paper Y.

Perhaps you could end up with a nice series, like:

1) “Required Reading” January 2020: P. E. Meehl (1967) “Theory-testing in Psychology and Physics: A methodological paradox”

2) “Required Reading” February 2020: G. Gigerenzer (2004) “Mindless Statistics”

Etc.

• Anoneuoid says:

Well, this isn’t my blog but basically you are suggesting a journal club.

Btw, if you like Paul Meehl. You can watch some of his lectures here: http://meehl.umn.edu/talks/philosophical-psychology-1989

• jd says:

This Meehl paper was amazing. I hadn’t heard of it. I searched this blog and found other posts about it.
Sure is interesting that everyone is still in a dither about this stuff, and it was all discussed in 1967…
It does seem a bit like that comment in Nature could have started with one big headline that read, “Hey! Read that Meehl paper from 52 years ago!”
Also, looks like I should have been pointed toward this stuff in school.