Skip to content

Male bisexuality gets Big PNAS Energy

Do flowers exist at night?—John Mulaney and the Sack Lunch Bunch

I have very little to say here, except to let you all know that the venerable PNAS has today published a paper (edited by Steven Pinker) letting use know that male bisexuality exists.

Here it is: Robust evidence for bisexual orientation among men (The paper has authors, which you can click through for, but I prefer to imagine it sprouting fully-formed from the mind of Zeus)

Big shout out to all my bi bros who have been making this happen for years and years without the privilege of existing.

A few choice quotes:

Some competing interests…

Competing interest statement: J.S. is president of the American Institute of Bisexuality (AIB), which has funded some of the research contributing to our data. Obtaining funding from the AIB has sometimes allowed it to have input into the design prior to funding approval. However, J.S. has had no role in data analysis or manuscript writing until the present article, when he has contributed to writing.

Size matters when studying if male bisexuality exists:

All existing studies have been of small to modest size; the largest had 114 participants.

Something something fallacy of decontextualized measurement:

Notably, across these studies, bisexual-identified men self-reported subjective arousal to both male and female stimuli, even in samples where their genital arousal did not reflect such a pattern.

I’ll take structural heterosexism and misogyny in a patriarchal society manifesting it self in different ways across gender for 500, Alex:

The question of whether bisexual arousal patterns exist has been less controversial about women than men

Honestly there’s some stats in the papers but who really cares. The data is combined across 6 studies that cover two different types of stimulus (but it’s not clear how this was taken into account). The data also has about 270 straight(-ish) people, 200 gay(-ish) people, and 100 bi people, so, you know, balance.

The analysis itself is some sort of linear regression with a breakpoint such that the regression was a U shape (the arousal was higher/lower depending on where your are on the Kinsey scale). They used two different break points (but didn’t seem to adjust for the multiple testing) rather than using an automatically detected break point.

The data is shared if anyone wants to play with it and check some things out. It’s on OSF. Nothing was pre-registered so go hogwild!

But I’m personally not that interested in spending any actual effort to “discover bisexuality”. (Who expected Colonize Bisexuality would be July 2020’s clarion call!)

Before you go, please joint me one more time in singing the male bisexual national anthem Do Flowers Exist At Night.

PS. I think this needs a quote from a paper that Andrew, Greggor and I wrote:

We can identify all these problems with what might be called the fallacy of decontextualized measurement, the idea that science proceeds by crisp distinctions modeled after asocial phenomena, such as unambiguous medical diagnoses (the presence or absence of streptococcus or the color change of a litmus paper). Seeking an on–off decision, normalizing a base rate to 50 percent, and, most problematically, stripping a phenomenon of its social context all give the feel of scientific objectivity while creating serious problems for generalizing findings to the world outside the lab or algorithm.

or, to put it more succinctly:

Imagine saying “the indium/gallium strain gauge I hooked up to this man’s penis didn’t show consistent arousal as I Clockwork Orange’d him with gender-spanning porn, so male bisexuality does not exist”.

This is a much bigger problem than a data analysis that is making a bunch of arbitrary decisions and not accounting for the heterogeneity in the data. It is a question of whether these experiments are appropriate methods to demonstrate the existence of bisexual men.

There’s obviously a different question of why, in the year of our lord two thousand and twenty, are researchers using the language of discovery in this context. And why they’d go straight for the old penis measuring method rather than just asking.

P.P.S. from Andrew: There seems to be some confusion in the comments about Dan’s point, so I elaborated here:

I agree that it can make sense to perform physiological tests of sexual responses. If you publish a study on that, you might draw some laughs (back in the 1970s there were the “Golden Fleece Awards,” but that comes with the territory. Just because a form of measurement is funny to some people, it doesn’t mean it shouldn’t be done.

I can’t speak for Dan, but I think his problem is not that this paper does some sort of dick-o-meter thing, but rather the points you make in the fourth paragraph of your comment. Also I have some problems with the statistics in the paper, which are related to Dan’s concerns about “language of discovery.”

Consider this statement from the paper:

Evidence for bisexual orientation requires that Minimum Arousal have an inverted U-shaped distribution and that Absolute Arousal Difference be U-shaped.

That makes no sense to me. Evidence for bisexual orientation requires only that there are some people who are sexually attracted to both men and women. No inverted-U-shaped distribution is required.

The above-quoted sentence is an example of scientism or the fallacy of measurement: taking a real and much-discussed phenomenon (bisexuality) and equating it with some particular measurement or statistical test.

P.P.P.S. (Dan again – sorry, I had other things to do today): Wow. The comments are certainly a journey aren’t they! A few things

  • Here is a paper (referenced in the above paper) that basically says the defensible version of what this paper shows: that patterns of arousal vary between monosexual and bisexual men, but there is a lot of variance among bisexual men. It does not claim to validate or provide evidence for the existence of bisexuality. If these particular patterns are of interest to you then that’s fine. It’s what’s being measured.
  • Andrew used the word “scientism”, which is an excellent way to describe the leap from the measurement scenario to claims about the validity of bisexuality (two of the authors have gone both ways on this). The paper is about patterns of arousal. The idea that a single common pattern of arousal can be used to categorize a sexual orientation is reductive to the point of absurdity.
  • Because apparently some people will not rest until I’ve been very clear about whether or not self-identification is reliable, let me just say  the following: If you’d asked me my orientation when I was 13 I would’ve said straight. If you’d inferred it by measuring arousal, my orientation was “busses going along bumpy roads”. There are indications that as time goes on, this happens less often (the straight thing. The bus thing is eternal). So while there is noise in self-identification, it’s no more than there is in most other demographic variables that we commonly use.
  • (To put that point somewhat differently, referring to these penis measurements as objective is not the same thing as them being objective. They are also noisy, contextual, and complicated by social and cultural forces.)

Can the science community help journalists avoid science hype? It won’t be easy.

tl;dr: Selection bias.

The public letter

Michael Eisen and Rob Tibshirani write:

Researchers have responded to the challenge of the coronavirus with a commitment to speed and cooperation, featuring the rapid sharing of preliminary findings through “preprints,” scientific manuscripts that have not yet undergone formal peer review. . . .

But the open dissemination of early versions of papers has created a challenge: how to ensure that policymakers and the public do not act too hastily on early studies that are soon shown to have serious errors. . . .

That is why we and a group of over 100 scientists are calling for American scientists and journalists to join forces to create a rapid-review service for preprints of broad public interest. It would corral a diverse contingent of scientists ready to comment on new preprints and to be responsive to reporters on deadline. . . .

My concerns

I think this proposed service could be a good idea. I have only three concerns:

1. The focus on peer review. Given all the problems we’ve seen with peer-reviewed papers, I don’t think preprints create a new challenge. Indeed, had peer review been some sort of requirement for attention, I’m pretty sure that the authors of that Santa Clara paper, with their connections, could’ve rushed it through an expedited peer review at PNAS or JAMA or Lancet or some other tabloid-style journal.

To put it another way, peer review is not generally done by “experts”; it’s done by “peers,” who often have the exact same blind spots as the authors of the papers being reviewed.

Remember Surgisphere? Remember Pizzagate? Remember himmicanes, air rage, ESP, ages ending in 9, beauty and sex ratio, etc etc etc?

2. This new service has to somehow stay independent of the power structure of academic science. For example, you better steer clear of the National Academy of Sciences, no joke, as they seem pretty invested in upholding the status of their members

3. My biggest concern has to do with the stories that journalists like to tell. Or, maybe I should say, stories that audiences like to hear.

One story people like is the scientist as hero. Another is the science scandal, preferably with some fake data.

But what about the story of scientists who are trying their best but are slightly over their head, no fake data but they’re going too far with their claims? This is a story that can be hard to tell.

For example, consider those Stanford medical researchers. They did a reasonable study but then they botched the statistics and hyped their claims. But their claims might be correct! As I and others have written a few thousand times by now, the Stanford team’s data are consistent with their story of the world—along with many other stories. The punchline is not that their claims about coronavirus are wrong; it’s that their study does not provide the evidence that they have claimed (and continue to claim). It’s the distinction between evidence and truth—and that’s a subtle distinction!

Another example came up a few years ago, when two economists published a paper claiming that death rates among middle-aged non-Hispanic whites were increasing. It turned out they were wrong: death rates had been increasing, then flat, during the time of their study. And, more relevantly, death rates had been steadily increasing among women in that demographic category but not men. The economists in their analysis had forgotten to do age adjustment, and it just happened that the baby boom passed through their age window during the period under study, causing the average age of their category to increase by just enough to show an artifactual rise in death rate.

Anyway, I had a hard time talking with reporters about this study when it came out. I made it clear on the blog that the economists had messed up by not age adjusting—but, at the same time, their key point, which was a comparison of the U.S. to other countries, still seemed to hold up.

I recall talking with a respected journalist from a major news outlet who just didn’t know what to do with this. He had three story templates in mind:

1. Brilliant Nobel-prize-winning economist makes major discovery, or

2. Bigshot Nobel-prize-winning economist gets it all wrong, or

3. Food fight in academia.

I wouldn’t give him any of the three stories, for the following reasons:

1. The published paper really was flawed, especially given that it was often taken to imply that there was an increasing mortality rate among middle-aged white men, which really wasn’t the case. This myth continues to be believed by major economists (see here, for example), I guess because it’s such a great story.

2. The paper had this big mistake but the main conclusion, comparing to other countries, seemed to hold up. So I refused to tell the reporter that the paper was wrong.

3. I didn’t want a food fight. I wanted to say that the authors of the paper made some good points, but there was this claim about increasing death rates that wasn’t quite right.

I wouldn’t play ball and create a fight, so the journalist went with storyline 1, of the scientist-as-hero.

It can be hard to report on a study that has flaws but is not an absolute train wreck of a scandal. Surgisphere—that’s easy to write about. The latest bit of messed-up modeling—not so much.

So I support Eisen and Tibshirani’s efforts. But I don’t think it’ll be easy, especially given that there are news outlets that will print misinformation put out by reporters who have an interest in creating science heroes. Yeah, I’m looking at you, “MIT’s science magazine.”

Selection bias

We’ve talked about this before; see here and here. Here’s the logic:

Suppose you’re a journalist and you hear about some wild claim made by some scientist somewhere. If you talk with some outside experts who convince you that the article is flawed, you’ll decide not to write about it. But somewhere else there is a reporter who swallows the press release, hook, line, and sinker. This other reporter would of course run a big story. Hence the selection bias that the stories that do get published are likely to repeat the hype. Which in turn gives a motivation for researchers and public relations people to do the hype in the first place.

P.S. Steve shares the above photo of Hopscotch, who seems skeptical about some of the claims being presented to him.

Coronavirus corrections, data sources, and issues.

This post is by Phil Price, not Andrew.

I’ve got a backlog of COVID-related stuff I’ve been meaning to post. I had intended to do a separate post about each of these, complete with citations and documentations, but the weeks are flying by and I’ve got to admit that that’s not going to happen. So you get this instead.

1. Alert the media: I made a mistake! Alex Gamma pointed out (a month and a half ago!) that I made a mistake in my plots of Years of Life Lost to coronavirus: I switched the labels of men and women. Alex wonders if the fact that this went unnoticed by me, or the dozens of commenters, is a reflection of people being used to the idea that women have it harder than men in just about everything, so seeing women supposedly being hit harder by COVID didn’t draw scrutiny. I don’t think that’s it — for one thing, we’re used to the fact that women live longer than men, so I think Alex’s proposal doesn’t fit here — but anyway I want to correct the record: there are more deaths, and more years of life lost, among men than among women.

2. Also in the “years of life lost” department, Konrad pointed out that in early May The Economist displayed some data showing the number of victims by age group, along with number of long-term health conditions, and years of life lost. There’s a lot of information in that graphic and I really appreciate the work that went into it. I wonder if there is some better way to display that information.

3. If you want to take a look at issues like the ones discussed above: Daniel Lakeland points out that number of COVID-19 deaths by sex, age group, and state is available from the US Department of Health. They’ve made some odd and slightly irritating choices in that datafile, e.g. the age groups aren’t all numeric (not even the first part of the string): there’s an “Under 1 year”. Why not 0-1, following the same pattern as the other age groups? Just adds one more pre-processing step if you want to do something like map these to actuarial tables. Speaking of which: expected years of life remaining, as a function of age and sex, is available from the Social Security Administration.

4. One issue I hope someone will take a look at — this means you! — is whether and how the distribution of deaths (and thus years of life lost) has changed with time. Daniel Lakeland suggested that we might expect this to change as the pandemic progresses, as vulnerable populations are better protected. One might expect that we will see fewer deaths per case, but with a lower percentage of deaths being those of the very old. Is this in fact happening?

This post is by Phil.

My talk this Wed 7:30pm (NY time) / Thurs 9:30am (Australian time) at the Victorian Centre for Biostatistics

The “Victorian Centre for Biostatistics,” huh? I guess maybe I should speak about Francis Galton or something.

Actually, though, I’ll be giving this talk:

Bayesian workflow as demonstrated with a coronavirus example

We recently fit a series of models to account for uncertainty and variation in coronavirus tests (see here). We will talk about the background of this problem and our analysis, and then we will expand into a general discussion of Bayesian workflow.

Would we be better off if randomized clinical trials had never been born?

This came up in discussion the other day. In statistics and medicine, we’re generally told to rely when possible on the statistically significance (or lack of statistical significance) of results from randomized trials. But, as we know, statistical significance has all sorts of problems, most notably that it ignores questions of cost and benefit, and it doesn’t play well with uncertainty. Hence my post, evidence-based medicine eats itself.

In comments, Nick wrote:

Ok, so it is not easy, but small incremental gains can get you a long way.

The amelioration of symptoms and prognosis of almost every common disease has improved since I [Nick] started clinical medicine in 1987; progress built on very many RCTs, none of them perfect but together forming a tapestry of overlapping evidential strands that can be read.

This made me wonder: Would this benefit have occurred without randomized clinical trials (RCTs), just by clinicians and researchers trying different things and publishing their qualitative findings? I have no idea (by which I really mean I have no idea, not that I’m saying that RCTs have no value).

There are famous examples of mistakes being made from poorly-adjusted observational studies (see for example here), and where this bias disappears in a randomized controlled trial. But my question is not, Can randomized clinical trials work well in particular examples? or, Are there examples where nonrandomized comparisons can be misleading? or, Can randomized trials be analyzed better? My question is, If randomized clinical trials were never done at all, would we be worse off than we are now, in terms of medical outcomes?

I think it’s possible that the answer to this question is No, that the damage done by statistical-significance thinking is greater than the benefits of controlled studies.

I have no idea how to estimate where we would be under that counterfactual, but perhaps it’s a useful thought experiment.

P.S. Just to clarify, I’m not saying that the alternative to randomized clinical trials would be pure guesswork or reasoning by anecdote. I assume that there would still be controlled comparisons of different treatments performed in comparable conditions, just without the randomization and the randomization-based inference.

Also, yes, I recognize that randomization can often make sense. I’m not saying that randomization is always a bad idea. I’m just wondering if the idea of randomization had never come up, whether on balance we’d be better off, in part because researchers and decision makers would need to wrestle directly with the issues of variation, uncertainty, and comparability of treatment and control groups, without thinking they had a magic wand that could just give them the answer.

“Sorry, there is no peer review to display for this article.” Huh? Whassup, BMJ?

OK, this is weird. Yesterday we reported on an article with questionable statistical analysis published in the British Medical Journal. This one’s different from some other examples we’ve discussed recently (Surgisphere and Stanford) in that the author list of this recent article includes several statisticians.

One way to get a handle on this situation is too look at the reviews of the article. Fortunately, as Keith O’Rourke points out, the journal has an open peer review policy:

For research papers The BMJ has fully open peer review. This means that accepted research papers submitted from September 2014 onwards usually have their prepublication history posted alongside them on

This prepublication history comprises all previous versions of the manuscript, the study protocol (submitting the protocol is mandatory for all clinical trials and encouraged for all other studies at The BMJ), the report from the manuscript committee meeting, the reviewers’ comments, and the authors’ responses to all the comments from reviewers and editors.

That’s great! There are some conditions:

In rare instances we determine after careful consideration that we should not make certain portions of the prepublication record publicly available. For example, in cases of stigmatised illnesses we seek to protect the confidentiality of reviewers who have these illnesses. In other instances there may be legal or regulatory considerations that make it inadvisable or impermissible to make available certain parts of the prepublication record.

In this case, though, what we have are statistical analyses of public data, so there should be nothing stopping us from seeing the entire record.

But then there’s this:

Sorry, there is no peer review to display for this article


The closest thing to a review is this journal editorial, “Lockdown-type measures look effective against covid-19,” which reports some concerns about the data (“subject to variable quality, accuracy, and inconsistent testing practices”) but also makes the statement, “This study is as good as it could be given the data available,” which seems highly debatable given our discussion from yesterday.

Why is there no peer review for the original article?

As regular readers know, I think peer review is overrated. But if a journal is supposed to do open peer review, we should get to see it, no?

P.S. BMJ has posted the peer reviews! Here they are. The reviews have lots of questions about data quality:

We think the biggest limitation of the data is that it does not take into account the heterogeneity of the response within a country . . . we wondered about the data from some countries in particular. Most countries show a lot of variability with the dots all over the place. . . .

But I saw no serious concerns raised with the statistical modeling, though. I think this may have partly been a problem of trust, as one reviewer writes:

The background of the research team is strong, consisting of experts in epidemiology, public health and statistics from distinguished tertiary institutions.

And another reviewer writes:

I think the authors did such a good job with the flawed data they have to work with . . .

Also, none of the reviewers commented on that Canada graph.

Overall, this document shows the strengths and weaknesses of traditional peer review. The strength is that the peer reviewers raise lots of specific concerns and are interested in the reasoning in the article, conditional on its basic approach being valid. The weaknesses is that the reviewers trust their peers—the authors of the article—and don’t ever consider the possibility that all the modeling being done there is a hot steaming mess.

Please socially distance me from this regression model!

A biostatistician writes:

The BMJ just published a paper using regression discontinuity to estimate the effect of social distancing. But they have terrible models. As I am from Canada, I had particular interest in the model for Canada, which is on their supplemental material, page 84 [reproduced above].

I could not believe this was published. Here they are interested in change in slope, but for some reason they have a change in intercept (a jump) parameter, which I find difficult to justify. They have plenty of bad models in my estimation.

For completeness, here is the main paper, but you don’t even need to look at it . . .

Please, I would like not to have my name mentioned.

I agree with my correspondent that this graph does not look like a very good advertisement for their method!

Why did the British Medical Journal publish this paper? It’s an important topic, maybe none of the reviewers actually read the paper carefully, maybe it’s actually a wonderful piece of science and my correspondent and I just don’t realize this . . . the usual explanations!

I’m guessing that medical and public health journals feel a lot of pressure to publish papers on coronavirus, and there’s also the same sort of fomo that led to the Journal of Personality and Social Psychology publishing that ESP article in 2011. Never underestimate the power of fomo.

P.S. The article has some peer reviews. See P.S. here.


Christian Hennig writes:

Most statisticians are aware that probability models interpreted in a frequentist manner are not really true in objective reality, but only idealisations. I [Hennig] argue that this is often ignored when actually applying frequentist methods and interpreting the results, and that keeping up the awareness for the essential difference between reality and models can lead to a more appropriate use and interpretation of frequentist models and methods, called frequentism-as-model. This is elaborated showing connections to existing work, appreciating the special role of i.i.d. models and subject matter knowledge, giving an account of how and under what conditions models that are not true can be useful, giving detailed interpreta- tions of tests and confidence intervals, confronting their implicit compatibility logic with the inverse probability logic of Bayesian inference, reinterpreting the role of model assumptions, appreciating robustness, and the role of “interpretative equivalence” of models. Epistemic (often referred to as Bayesian) probability shares the issue that its models are only idealisations and not really true for modelling reasoning about uncertainty, meaning that it does not have an essential advantage over frequentism, as is often claimed. Bayesian statistics can be combined with frequentism-as-model, leading to what Gelman and Hennig (2017) call “falsificationist Bayes”.

I’n interested in this topic (no surprise given the reference to our joint paper, “Beyond subjective and objective in statistics.”

I’ve long argued that Bayesian statistics is frequentist, in the sense that the prior distribution represents the distribution of parameter values among all problems for which you might apply a particular statistical model. Or, as I put it here, in the context of statistics being “the science of defaults”:

We can understand the true prior by thinking of the set of all problems to which your model might be fit. This is a frequentist interpretation and is based on the idea that statistics is the science of defaults. The true prior is the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.

Here we are thinking of the statistician as a sort of Turing machine that has assumptions built in, takes data, and performs inference. The only decision this statistician makes is which model to fit to which data (or, for any particular model, which data to fit it to).

We’ll never know what the true prior is in this world, but the point is that it exists, and we can think of any prior that we do use as an approximation to this true distribution of parameter values for the class of problems to which this model will be fit.

I like what Christian has to say in his article. I’m not quite sure what to do with it right now, but I think it will be useful going forward when I next want to write about the philosophy of statistics.

Frequentist thinking is important in statistics, for at least four reasons:

1. Many classical frequentist methods continue to be used by practitioners.

2. Much of existing and new statistical theory is frequentist; this is important because new methods are often developed and understood in a frequentist context.

3. Bayesian methods are frequentist too; see above discussion.

4. Frequentist ideas of compatibility remain relevant in many examples. It can be useful to know that a certain simple model is compatible with the data.

So I’m sure we’ll be talking more about all this.

Dispelling confusion about MRP (multilevel regression and poststratification) for survey analysis

A colleague pointed me to this post from political analyst Nate Silver:

At the risk of restarting the MRP [multilevel regression and poststratification] wars: For the last 3 models I’ve designed (midterms, primaries, now revisiting stuff for the general) trying to impute how a state will vote based on its demographics & polls of voters in other states is only a mediocrely accurate method.

It’s a decent stand-in when you have few polls and weak priors. Our models do use it a little bit of it. But generally speaking, looking directly at the polls in a state is quite a bit more accurate. And often, so are simpler “fundamentals” methods (e.g. national polls + PVI).

Part of the issue is that seemingly demographically similar voters in different states may not actually vote all that similarly. e.g. a 46-year-old Hispanic man in California probably has different views on average than a 46-year-old Hispanic man in Idaho or Florida…

That’s partly because there are likely some unobserved characteristics (maybe the voter in Florida is Cuban-American and the one in California is Mexican-American) but also because states have different political cultures and are subject to different levels of campaign activity.

MRP does partial pooling. If you are interested in estimating public opinion in the 50 states, you’ll want to use all available information, including state polls and also national polls. To do this you’ll need a model of time trends in national and state opinions. That’s what Elliott, Merlin, and I do here. We don’t use MRP (except to the extent that the individual polls use MRP to get their topline estimates), but if we had the raw data from all the polls, and we had the time to devote to the project, we would do MRP on the raw data.

Because MRP does partial pooling of different sources of information, it should not do any worse than any one piece of this information.

So when Nate writes that “looking directly at the polls in a state is quite a bit more accurate” than MRP, I’m pretty sure that he’s just doing a crappy version of MRP.

A crappy version of any method can do really bad, I’ll grant you that.

The most useful step at this point would be for Nate to share his MRP code and then maybe someone could take a look and see what he’s doing wrong. Statistics is hard. I’ve made lots of mistakes myself (for example, here’s the story an embarrassing example in polling analysis from a few years back).

There indeed are examples where MRP won’t help much, if the separate estimates from each state have low bias and low uncertainty. But in that case MRP will just do very little partial pooling; it should not perform worse than the separate estimates.

Similarly for the fundamentals-based models that Nate mentions. Our forecast partially pools toward the fundamentals-based models. More generally, if the fundamentals-based models predict well, that should just help MRP. But you do have to include that information in the model: MRP, like any statistical method, is only as good as the information that goes into it.

Cooperation, not competition

I feel like the problem is that that Nate is seeing different methods as competing rather than cooperating. From his perspective, it’s MRP or state polls or fundamentals. But it doesn’t have to be one or the other. We can use all the information!

To put it another way: choosing a statistical method or a data source is not like an election where you have to pick the Democrat or the Republican. You can combine information and partially pool. The details of how to do this can get complicated, and it’s easy enough to get it wrong—that’s been my experience. But if you’re careful, you can put the information together and get more out of it than you’d learn from any single source. That’s why people go to the trouble of doing MRP in the first place.

I’m not saying that Nate should go use MRP

There are many roads to Rome. What’s important about a statistical method is not what it does with the data, so much as what data it uses. Nate’s done lots of good work over the years, and if he can manage to use all the relevant information in a reasonable way, then he can get good answers, whether or not he’s using a formal Bayesian adjustment procedure such as MRP.

Indeed, I’d rather that Nate use the method he’s comfortable with and do it well, than use some crappy version of MRP that gives bad answers. I do think there are advantages of using MRP—at some point, doing multiple adjustments by hand is like juggling plates, and you start to have issues with impedance matching—but it also can be hard to start from scratch. So I can accept Nate’s argument that MRP, as he has tried to implement it, has problems. The point of this post is just to clear up misunderstandings that might arise. If you do MRP right, you should do better than any of the individual data sources.

People are different. Your model should account for that.

Nate correctly writes that, “a 46-year-old Hispanic man in California probably has different views on average than a 46-year-old Hispanic man in Idaho or Florida.” Yup. Part of this is that the average political positions in California, Idaho, and Florida differ. In a basic MRP model with no interactions, the assumption would not be that middle-aged Hispanics vote the same on average in each state. Rather, the assumption would be that the average difference (on the logistic scale) between Hispanics and non-Hispanic whites would be the same in each state, after adjusting for other demographics such as age, education, and sex that might be included in the model. That said, in real life the average difference in attitudes, comparing Hispanics to non-Hispanic whites with the same demographics in the same state, will vary by state, and if you think this is important you can (indeed should) include it in your MRP model.

The relevant point here is that you should be able to directly feed your substantive understanding of voters into the construction of your MRP model. And that’s how it should be.

P.S. Zad sends in the above photo showing us what’s really inside the black box.

The War on Data: Now we play the price

A few years ago, Mark Palko wrote an article, The War on Data, where we wrote:

The value of shared data reaches its logical extreme in high-quality, publicly available databases such as those maintained by the U.S. Census Bureau. These sources do not just support an extraordinary amount of research; they help individuals and institutions make better decisions and give us a set of agreed-upon facts that help keep our discussion honest and productive. For all these reasons, recent threats to publicly available data are cause for concern.

I’d forgotten about this until I came across a series of recent posts from Palko reprising the themes. This is all particularly relevant now, when we have the oddity of super-precise reports of the stock market and unemployment filings but no comprehensive sampling for coronavirus testing.

Yasmeen Abutaleb, Josh Dawsey, Ellen Nakashima, and Greg Miller tell the story in this news article:

The United States will likely go down as the country that was supposedly best prepared to fight a pandemic but ended up catastrophically overmatched by the novel coronavirus, sustaining heavier casualties than any other nation.

It did not have to happen this way. Though not perfectly prepared, the United States had more expertise, resources, plans and epidemiological experience than dozens of countries that ultimately fared far better in fending off the virus. . . .

The failure has echoes of the period leading up to 9/11: Warnings were sounded, including at the highest levels of government, but the president was deaf to them until the enemy had already struck.

The Trump administration received its first formal notification of the outbreak of the coronavirus in China on Jan. 3. Within days, U.S. spy agencies were signaling the seriousness of the threat . . .

The most consequential failure involved a breakdown in efforts to develop a diagnostic test that could be mass produced and distributed across the United States, enabling agencies to map early outbreaks of the disease, and impose quarantine measures to contain them. . . .

It does seem like a failure in not recognizing the value of good public data.

Association Between Universal Curve Fitting in a Health Care Journal and Journal Acceptance Among Health Care Researchers

Matt Folz points us to this recent JAMA article that features this amazing graph:

Beautiful. Just beautiful. I say this ironically.

What can be our goals, and what is too much to hope for, regarding robust statistical procedures?

Gael Varoquaux writes:

Even for science and medical applications, I am becoming weary of fine statistical modeling efforts, and believe that we should standardize on a handful of powerful and robust methods.

First, analytic variability is a killer, e.g. in “standard” analysis for brain mapping, for machine learning in brain imaging, or more generally in “hypothesis driven” statistical testing.

We need weakly-parametric models that can fit data as raw as possible, without relying on non-testable assumptions.

Machine learning provides these, and tree-based models need little data transformations.

We need non-parametric model selection and testing, that do not break if the model is wrong.

Cross-validation and permutation importance provide these, once we have chosen input (endogenous) and output (exogenous) variables.

If there are less than a thousand data points, all but the simple statistical question can and will be gamed (sometimes unconsciously), partly for lack of model selection. Here’s an example in neuroimaging.

I [Varoquaux] no longer trust such endeavors, including mine.

For thousands of data points and moderate dimensionality (99% of cases), gradient-boosted trees provide the necessary regression model.

They are robust to data distribution and support missing values (even outside MAR settings).

For thousands of data points and large dimensionality, linear models (ridge) are needed.

But applying them without thousands of data points (as I tried for many years) is hazardous. Get more data, change the question (eg analyze across cohorts).

Most questions are not about “prediction”. But machine learning is about estimating functions that approximate conditional expectations / probability. We need to get better at integrating it in our scientific inference pipelines.

My reply:

There are problems where automatic methods will work well, and problems where they don’t work so well. For example, logistic regression is great, but you wouldn’t want to use logistic regression to model Pr(correct answer) given ability, for a multiple choice test question where you have a 1/4 chance of getting the correct answer just by guessing. Here it would make more sense to use a model such as Pr(y=1) = 0.25 + 0.75*invlogit(a + bx). Of course you could generalize and then say, perhaps correctly, that nobody should ever do logistic regression; we should always fit the model Pr(y=1) = delta_1 + (1 – delta_1 – delta_2)*invlogit(a + bx). The trouble is that we don’t usually fit such models!

So I guess the point is that we should keep pushing to make our models more general. What this often means in practice is that we should be regularizing our fits. One big reason we don’t always fit general models is that it’s hard to estimate a lot of parameters using least squares or maximum likelihood or whatever.

I agree with your statement that “we should standardize on a handful of powerful and robust methods.” Defaults are not only useful; they are also in practice necessary. This also suggests that we need default methods for assessing the performance of these methods (fit to existing data and predictive power on new data). If users are given only a handful of defaults, then these users—if they are serious about doing their science, engineering, policy analysis, etc.—will need to do lots of checking and evaluation.

I disagree with your statement that we can avoid “relying on non-testable assumptions.” It’s turtles all the way down, dude. Cross-validation is fine for what it is, but we’re almost always using models to extrapolate, not to just keep on replicating our corpus.

Finally, it’s great to have thousands, or millions, or zillions of data points. But in the meantime we need to learn and make decisions from what information that we have.


When I get magazines in the mail, I put them in a pile so that later I can read them in order. I’m a few months behind on the London Review of Books so I just happened to read this article by August Kleinzahler which informs us that Donald Trump is invincible.

I have no problem with literary figures giving their takes on politics, but this is just a rehash of the major news media. It’s not an original literary take; it’s your grandpa who’s been reading a bunch of newspapers and watching lots of TV news and then regurgitating it back to you. This would be like a political magazine running a literature column by some pundit who did nothing more than read a bunch of book reviews and then churn out more of the same old conventional wisdom.

Or they could just go all-in and run articles by pundits on twitter who say things like, “Telling people to wear masks might be the single thing Trump could do to most improve his re-election prospects.”

P.S. Erik sends in the above picture of Ocelot who has gotten comfortable and is patiently waiting for me to take a look at the paper on default priors that we’re working on. I promise that I’ll get to it, right after this bit of bloggy procrastination.

Probabilities for action and resistance in Blades in the Dark

Later this week, I’m going to be GM-ing my first session of Blades in the Dark, a role-playing game designed by John Harper. We’ve already assembled a crew of scoundrels in Session 0 and set the first score. Unlike most of the other games I’ve run, I’ve never played Blades in the Dark, I’ve only seen it on YouTube (my fave so far is Jared Logan’s Steam of Blood x Glass Cannon play Blades in the Dark!).

Action roll

In Blades, when a player attempts an action, they roll a number of six-sided dice and take the highest result. The number of dice rolled is equal to their action rating (a number between 0 and 4 inclusive) plus modifiers (0 to 2 dice). The details aren’t important for the probability calculations. If the total of the action rating and modifiers is 0 dice, the player rolls two dice and takes the worst. This is sort of like disadvantage and (super-)advantage in Dungeons & Dragons 5e.

A result of 1-3 is a failure with a consequence, a result of 4-5 is a success with a consequence, and a result of 6 is an unmitigated success without a consequence. If there are more than two 6s in the result, it’s a success with a benefit (aka a “critical” success).

The GM doesn’t roll. In a combat situation, you can think of the player roll encapsulating a turn of the player attacking and the opponent(s) counter-attacking. On a result of 4-6, the player hits, on a roll of 1-5, the opponent hits back or the situation becomes more desperate in some other way like the character being disarmed or losing their footing. On a critical result (two or more 6s in the roll), the player succeeds with a benefit, perhaps cornering the opponent away from their flunkies.

Resistance roll

When a player suffers a consequence, they can resist it. To do so, they gather a pool of dice for the resistance roll and spend an amount of stress equal to six minus the highest result. Again, unless they have zero dice in the pool, in which case they can roll two dice and take the worst. If the player rolls a 6, the character takes no stress. If they roll a 1, the character takes 5 stress (which would very likely take them out of the action). If the player has multiple dice and rolls two or more 6s, they actually reduce 1 stress.

For resistance rolls, the value between 1 and 6 matters, not just whether it’s in 1-3, in 4-5, equal to 6, or if there are two 6s.

Resistance rolls are rank statistics for pools of six-sided dice. Action rolls just group those. Plus a little sugar on top for criticals. We could do this the hard way (combinatorics) or we could do this the easy way. That decision was easy.

Here’s a plot of the results for action rolls, with dice pool size on the x-axis and line plots of results 1-3 (fail plus a complication), 4-5 (succeed with complication), 6 (succeed) and 66 (critical success with benefit). This is based on 10m simulations.

You can find a similar plot from Jasper Flick on AnyDice, in the short note Blades in the Dark.

I find the graph pretty hard to scan, so here’s a table in ASCII format, which also includes the resistance roll probabilities. The 66 result (at least two 6 rolls in the dice pool) is a possibility for both a resistance roll and an action roll. Both decimal places should be correct given the 10M simulations.

DICE   RESISTANCE                      ACTION           BOTH

DICE    1    2    3    4    5    6     1-3  4-5    6      66
----  ----------------------------     -------------    ----
 0d   .36  .25  .19  .14  .08  .03     .75  .22  .03     .00

 1d   .17  .17  .17  .17  .17  .17     .50  .33  .17     .00
 2d   .03  .08  .14  .19  .25  .28     .25  .44  .28     .03

 3d   .01  .03  .09  .17  .29  .35     .13  .45  .35     .07
 4d   .00  .01  .05  .14  .29  .39     .06  .42  .39     .13

 5d   .00  .00  .03  .10  .27  .40     .03  .37  .40     .20
 6d   .00  .00  .01  .07  .25  .40     .02  .32  .40     .26

 7d   .00  .00  .01  .05  .22  .39     .01  .27  .39     .33
 8d   .00  .00  .00  .03  .19  .38     .00  .23  .38     .39

One could go for more precision with more simulations, or resort to working them all out combinatorially.

The hard way

The hard way is a bunch of combinatorics. These aren’t too bad because of the way the dice are organized. For the highest value of throwing N dice, the probability that a value is less than or equal to k is one minus the probability that a single die is greater than k raised to the N-th power. It’s just that there are a lot of cells in the table. And then the differences would be required. Too error prone for me. Criticals can be handled Sherlock Holmes style by subtracting the probability of a non-critical from one. A non-critical either has no sixes (5^N possibilities with N dice) or exactly one six ((6 choose 1) * 5^(N – 1)). That’s not so bad. But there are a lot of entries in the table. So let’s just simulate.

Edit: Cumulative Probability Tables

I really wanted the cumulative probability tables of a result or better (I suppose I could’ve also done it as result or worse). I posted these first on the Blades in the Dark forum. It uses Discourse, just like Stan’s forum.

Action Rolls

Here’s the cumulative probabilities for action rolls.

And here’s the table of cumulative probabilities for action rolls, with 66 representing a critical, 6 a full success, and 4-5 a partial success:

        probability of result or better
 dice   4-5+     6+     66
    0  0.250  0.028  0.000
    1  0.500  0.167  0.000
    2  0.750  0.306  0.028
    3  0.875  0.421  0.074
    4  0.938  0.518  0.132
    5  0.969  0.598  0.196
    6  0.984  0.665  0.263
    7  0.992  0.721  0.330
    8  0.996  0.767  0.395

Resistance Rolls

And here are the basic probabilities for resistance rolls.

Here’s the table for stress probabilities based on dice pool size


             Probability of Stress
Dice    5    4    3    2    1    0   -1
   0  .31  .25  .19  .14  .08  .03  .00 
   1  .17  .17  .17  .17  .17  .17  .00
   2  .03  .08  .14  .19  .25  .28  .03
   3  .00  .03  .09  .17  .28  .35  .07
   4  .00  .01  .05  .13  .28  .39  .13
   5  .00  .00  .03  .10  .27  .40  .20
   6  .00  .00  .01  .07  .25  .40  .26
   7  .00  .00  .01  .05  .22  .39  .33
   8  .00  .00  .00  .03  .19  .37  .40

Here’s the plot for the cumulative probabilities for resistance rolls.

Here’s the table of cumulative resistance rolls.


             Probability of Stress or Less
Dice       5      4     3     2    1    0     -1
   0    1.00    .69   .44   .25   .11  .03   .00 
   1    1.00    .83   .67   .50   .33  .17   .00
   2    1.00    .97   .89   .75   .56  .31   .03
   3    1.00   1.00   .96   .87   .70  .42   .07
   4    1.00   1.00   .99   .94   .80  .52   .13
   5    1.00   1.00  1.00   .97   .87  .60   .20
   6    1.00   1.00  1.00   .98   .91  .67   .26
   7    1.00   1.00  1.00   .99   .94  .72   .33
   8    1.00   1.00  1.00  1.00   .96  .77   .40

For example, with 4 dice (the typical upper bound for resistance rolls), there’s an 80% chance that the character takes 1, 0, or -1 stress, and 52% chance they take 0 or -1 stress. With 0 dice, there’s a better than 50-50 chance of taking 4 or more stress because the probability of 3 or less stress is only 44%.

Finally, here’s the R code for the resistance and cumulative resistance.

# row = dice, col = c(1:6, 66)
resist <- matrix(0, nrow = 8, ncol = 7)
resist[1, 1:6] <- 1/6
for (d in 2:8) {
  for (result in 1:5) {
    resist[d, result] <-
      sum(resist[d - 1, 1:result]) * 1/6 +
      resist[d - 1, result] *  (result -1) / 6
  resist[d, 6] <- sum(resist[d - 1, 1:5]) * 1/6 +
                  sum(resist[d - 1, 6]) * 5/6
  resist[d, 7] <- resist[d - 1, 7] + resist[d - 1, 6] * 1/6

cumulative_resist <- resist  # just for sizing
for (d in 1:8) {
  for (result in 1:7) {
    cumulative_resist[d, result] <- sum(resist[d, result:7])


zero_dice_probs <-  c(11, 9, 7, 5, 3, 1, 0) / 36
zero_dice_cumulative_probs <- zero_dice_probs
for (n in 1:7)
  zero_dice_cumulative_probs[n] <- sum(zero_dice_probs[n:7])

z <- melt(cumulative_resist)  # X1 = dice, X2 = result, value = prob
stress <- 6 - z$X2
df <- data.frame(dice = z$X1, stress = as.factor(stress), prob = z$value)
df <- rbind(df, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7), prob = zero_dice_cumulative_probs))

cumulative_plot <- ggplot(df, aes(x = dice, y = prob,
                   colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress or less") +
  scale_x_continuous(breaks = 0:8)
ggsave('cumulative-resistance.jpg', plot = cumulative_plot, width = 5, height = 4)

z2 <- melt(resist)  # X1 = dice, X2 = result, value = prob
stress2 <- 6 - z2$X2
df2 <- data.frame(dice = z2$X1, stress = as.factor(stress2), prob = z2$value)
df2 <- rbind(df2, data.frame(dice = rep(0, 7), stress = as.factor(6 - 1:7),
                             prob = zero_dice_probs))

plot <- ggplot(df2, aes(x = dice, y = prob,
               colour = stress, group = stress)) +
  geom_line() + geom_point() +
  xlab("dice for resistance roll") +
  ylab("prob of stress") +
  scale_x_continuous(breaks = 0:8)
ggsave('resistance.jpg', plot = plot, width = 5, height = 4)

What does it take to be omniscient?

Palko points us to this comment from Josh Marshall:

To put it baldly, if it’s a topic and area of study you know nothing about and after a few weeks of cramming you decide that basically everyone who’s studied the question is wrong, there’s a very small chance you’ve rapidly come upon a great insight and a very great likelihood you’re an ignorant and self-regarding asshole. Needless to say, those are odds Dershowitz is happy to take. Dershowitz has now ‘read all the relevant historical material’ and has it covered.

I responded: “Drinking that Harvard kool-aid.”

To which Palko replied: “I suppose if you had a Harvard professor who was an economist and a doctor, he’d be omniscient.” In the meantime, we’ll have to go with the Harvard professors we have available. Good news is their replication rate “is quite high—indeed, it is statistically indistinguishable from 100%.”

Conflicting public attitudes on redistribution

Sociologist David Weakliem wrote recently:

A Quinnipiac poll from April 2019:

“Do you support or oppose raising the tax rate to 70% on an individual’s income that is over $10 million dollars?” 36% support, 59% oppose

A CNN poll from February 2019:

“Would you favor or oppose raising the personal income tax rate for those with very high incomes, so that income of ten million dollars or more would be taxed at a rate of 70%?” 41% favor, 52% oppose

A CBS News Poll from September 2009:

“If the Obama Administration proposed a tax of 50 percent or higher on the incomes of the very wealthiest millionaires, would you support it, or not?” 51% yes, 45% no

Even people who are towards the bottom of the economic ladder aren’t very enthusiastic. The Quinnipiac results were not broken down by income, but only 31% of whites without a college degree, 51% of blacks, and 47% of Hispanics supported a 70% tax.

This relates to an issue I [Weakliem] have written about before. It’s sometimes said that most people are to the left on economic issues. This suggests that the only way conservatives can win elections is by diverting their attention to “culture war” issues, or race, or some other area where the right has an advantage. But the idea that the public is to the left on economic issues is wrong—in addition to the lack of support for high tax rates, there’s not much support for inheritance taxes. This doesn’t mean that the public is conservative on economic issues—for example, most people are in favor of maintaining or increasing Social Security benefits, increasing the minimum wage, and increasing taxes on corporations. Public opinion on economic issues don’t really fit on a left/right scale . . .

A few days later, he followed up:

I [Weakliem] realize I left something out of my last post, which said that Americans were not in favor of high taxes on the rich. The Paul Krugman column that I mentioned said “A . . . large majority has consistently said that upper-income Americans pay too little, not too much, in taxes.” He is right–since 1992, the Gallup poll as asked if upper income people are “are paying their FAIR share in federal taxes, paying too MUCH or paying too LITTLE?” In the latest survey (2019), 9% said too much, 27% fair share, and 62% said too little. The share saying too little has never gone below 55%. But as my post pointed out, when you ask how much high-income people should pay, most people don’t suggest high rates. In addition to the questions I mentioned last time, here’s a Gallup/USA Today poll from 2011: “Now thinking about the wealthiest one percent of Americans, what percentage of their income do you think they should pay to the federal government in income taxes each year?” Among those who gave an answer (28% didn’t), the mean was about 24%, and only 10% said 40% or more.

Weakliem asks:

How do you reconcile these results?

His answer:

Most people seem to think that people with high incomes are taxed at lower rates than most middle-income people. A 2003 survey asked “In the United States, which group do you think pays the highest percentage of their income in total federal taxes: high-income people, middle-income people, or lower-income people, or don’t you know enough to say?” 25% said high-income people, 51% said middle-income people, and 11% said low income people (13% said they didn’t know). Even among people with college degrees and people earning $75,000 or more (the highest income class distinguished in the survey), most people thought that middle income people paid the highest percentage. Other surveys show that most people know that in principle marginal tax rates increase with income, so presumably they think that high-income people are able to get out of taxes by finding loopholes.

So when people say that high income people should pay more, they are just saying that they want them to pay at the same rate that middle-class people do, or maybe a slightly higher rate. In reality, they already do pay at a somewhat higher rate. Most people haven’t thought about the issue all that much, so you can’t make precise statements about public opinion. But in a rough sense, Americans are getting about as much redistribution as we want.

I dunno, I think it’s more complicated than that. I feel that much of the contradictions in public opinion arise from conflicting implications of “fairness.” On one hand, it seems fair if all are taxed at an equal rate; on the other hand, it seems fair that rich people pay more. Another complication is that taxes don’t exist in a vacuum; they’re the flip side of spending. On one hand, people like most government programs: survey respondents typically want to spend less on the military and on foreign aid but to maintain or increase spending on just about everything else. On the other hand, tax money goes to the government, and people mostly don’t trust the government.

So it’s tricky. Attitudes don’t exist in isolation.

In any case, Weakliem should have his own NYT column (along with Jay Livingston). Keep Krugman and Brooks; just reduce their frequencies and alternate them with Weakliem and Livingston.

Further debate over mindset interventions


Following up on this post, “Study finds ‘Growth Mindset’ intervention taking less than an hour raises grades for ninth graders,” commenter D points us to this post by Russell Warne that’s critical of research on growth mindset.

Here’s Warne:

Do you believe that how hard you work to learn something is more important than how smart you are? Do you think that intelligence is not set in stone, but that you can make yourself much smarter? If so, congratulations! You have a growth mindset.

Proposed by Stanford psychologist Carol S. Dweck, mindset theory states that there are two perspectives people can have on their abilities. Either they have a growth mindset–where they believe their intelligence and their abilities are malleable–or they have a fixed mindset. People with a fixed mindset believe that their abilities are either impossible to change or highly resistant to change.

According to the theory, people with a growth mindset are more resilient in the face of adversity, persist longer in tasks, and learn more in educational programs. People with a fixed mindset deny themselves of these benefits.

I think he’s overstating things a bit: First, I think mindsets are more continuous than discrete: Nobody can realistically think that, on one hand, that hard work can’t help you learn, or, on the other hand, that all people are equally capable of learning something, if they just work hard. I mean, sure, maybe you can find inspirational quotes or whatever, but no one could realistically believe either of these extremes. Similarly, it’s not clear what is meant by hard work being “more important” than smarts, given that these two attributes would be measured on different scales.

But, sure, I guess that’s the basic picture.

Warne summarizes the research:

On the one side are the studies that serious call into question mindset theory and the effectiveness of its interventions. Li and Bates (2019) have a failed replication of Mueller and Dweck’s (1998) landmark study on how praise impacts student effort. Glerum et al. (in press) tried the same technique on older students in vocational students and found zero effect. . . .

The meta-analysis from Sisk et al. (2018) is pretty damning. They found that the average effect size for mindset interventions was only d = .08. (In layman’s terms, this would move the average child from the 50th to the 53rd percentile, which is extremely trivial.) Sisk et al. (2018) also found that the average correlation between growth mindset and academic performance is a tiny r = .10. . . .

On the other hand, there are three randomized control studies that suggest that growth mindset can have a positive impact on student achievement. Paunesku et al. (2015) found that teaching a growth mindset raised the grad point averages of at-risk students by 0.15 points. (No overall impact for all students is reported.) Yeager et al. (2016) found that at-risk students’ GPAs improved d = .10, but students with GPAs above the median had improvements of only d = .03. . . .

So, mixed evidence. But I think that we can all agree that various well-publicized claims of huge benefits of growth mindset are ridiculous overestimates, for the same reason that we don’t believe that early childhood intervention increases adult earnings by 42%, etc etc etc. On the other hand, smaller effects in some particular subsets of the population . . . that’s more plausible.

Then Warne lays down the hammer:

For a few months, I puzzled over the contradictory literature. The studies are almost evenly balanced in terms of quality and their results.

Then I discovered the one characteristic that the studies that support mindset theory share and that all the studies that contradict the theory lack: Carol Dweck. Dweck is a coauthor on all three studies that show that teaching a growth mindset can improve students’ school performance. She is also not a coauthor on all of the studies that cast serious doubt on mindset theory.

So, there you go! Growth mindsets can improve academic performance—if you have Carol Dweck in charge of your intervention. She’s the vital ingredient that makes a growth mindset effective.

I don’t think Warne really thinks that Dweck can really make growth mindset work. I think he’s being ironic and that what he’s really saying is that the research published by Dweck and her collaborators is not to be trusted.


I sent the above to David Yeager, first author of the recent growth-mindset study, and he replied:

I [Yeager] don’t see why there has to be a conflict between mindset and IQ; there is plenty of variance to go around. But that aside, I think the post reflects a few outdated ways of thinking that devoted readers of your papers and your blog would easily spot.

The first is a “vote-counting” approach to significance testing, which I think you’ve been pretty clear is a problem. The post cites Rienzo et al. as showing “no impact” for growth mindset and our Nature paper as showing “an impact.” But the student intervention in Rienzo showed an ATE of .1 to 18 standard deviations (pg. 4 That’s anywhere from 2 to 3.5X the ATE from the student intervention in our pre-registered Nature paper (which was .05 SD).  But Rienzo’s effects aren’t significant because it’s a cluster-randomized trial, while ours are because we did a student-level randomized trial. The minimum detectable effect for Rienzo was .4 to .5 SD, and I’ve never done a mindset study with anywhere near that effect size! It’s an under-powered study.

In a paper last year, McShane argued pretty persuasively that we need to stop calling something a failed replication when it has the same or larger effect as previous studies, but wider confidence intervals. The post you sent didn’t seem to get that message.

Second, the post uses outdated thinking about standardized effect sizes for interventions. The .1 to .18 in Rienzo are huge effects for adolescent RCTs. When you look at the I3 evaluations, which have the whole file drawer and pre-registered analyses, you can get an honest distribution of effects, and almost nothing exceeds .18 (Matt Kraft did this analysis). The median for adolescent interventions is .03. If the .18 is trustworthy, that’s massive, not counterevidence for the theory.

Likewise, the post says that an ATE of .08, which is what Sisk et al. estimated, is “extremely trivial.” But epidemiologists know really well (e.g. Rose’s prevention paradox) that a seemingly small average effect could mask important subgroup effects, and as long as those subgroup effects were reliable and not noise, then depending on the cost and scalability of the intervention, an ATE of .08 could be very important. And seemingly small effects can have big policy implications when they move people across critical thresholds. Consider that the ATE in our Nature paper was .05, and the effect in the pre-registered group of lower-achievers was .11. That corresponded to an overall 3 percentage point decrease in failing to make adequate progress in 9th grade, and a 3 point increase in taking advanced math the next year, both key policy outcomes. This is pretty good considering that we already showed the intervention could be scaled across the U.S. by third parties, and could be generalized to 3 million students per year in the U.S. I should note that Paunesku et al. 2015 and Yeager et al. 2016 also reported the D/F reduction in big samples, and a new paper from Norway replicated the advanced math result. So these are replicable, meaningful policy-relevant effects from a light-touch intervention, even if they seem small in terms of raw standard deviations.

Unfortunately, unrealistic thinking about effect sizes is common in psychology, and it is kept alive by the misapplication of effect size benchmarks, like you see in the Sisk et al.. Sisk et al. stated that the “average effect for a typical educational intervention on academic performance is .57,” (pg. 569) but Macnamara is citing John Hattie’s meta-analysis. As Slavin put it, “John Hattie is wrong.” And in the very paper that Macnamara cites for the .57 SD “typical effect,” Hattie says that those are immediate, short-term effects; when he subsets on longer-term effects on academic outcomes, which the mindset interventions focus on, it “declined to an average of .10.” (pg. 112). But Sisk/Macnamara cherry-pick the .57. I don’t see how Sisk et al. reporting .08 for the ATE and more than twice that for at-risk or low-ses groups is “damning.” .08 ATE seems pretty good, considering the cost and scalability of the intervention and the robust subgroup effects.

The third outdated way of thinking is that it is focused on main effects, not heterogeneous effects. In a new paper that Beth Tipton and I wrote [see yesterday’s post], we call it a “hetero-naive” way of thinking.

One way this post is hetero-naive is by assuming that effects from convenience samples, averaged in a meta-analysis, give you “the effect” of something. I don’t see any reason to assume that meta-analysis of haphazard samples converges on a meaningful population parameter of any kind. It might turn out that way by chance sometimes, but that’s not a good default assumption. For instance, Jon Krosnick and I show the non-correspondence between meta-analyses of haphazard samples and replications in representative samples in the paper I sent you last year.

The post’s flawed assumption really pops out when this blog post author cites a meta-analysis of zero-order correlations between mindset and achievement. I don’t see any reason why we care about the average of a haphazard sample of correlational studies when we can look at truly generalizable samples. The 2018 PISA administered the mindset measure to random samples from 78 OECD nations, with ~600,000 respondents, and they find mindset predicts student achievement in all but three. With international generalizability, who cares what Sisk et al. found when their meta-analysis averaged a few dozen charter school kids with a few dozen undergrads and a bunch of students on a MOOC?

Or consider that this post doesn’t pay attention to intervention fidelity as an explanation for null results, even though that’s the very first thing that experts in education focus on (see this special issue). I heard that, in the case of the Foliano study, up to 30% of the control group schools already were using growth mindset and even attended the treatment trainings, and about half of the treatment group didn’t attend many of the trainings. On top of that, the study was a cluster-randomized trial and had an MDE larger than our Nature paper found, which means they were unlikely to find effects even with perfect fidelity.

I don’t mean to trivialize the problems of treatment fidelity; they are real and they are hard to solve, especially in school-level random assignment. But those problems have nothing to do with growth mindset theory and everything to do with the challenges of educational RCTs. It’s not Carol Dweck’s fault that it’s hard to randomize teachers to PD.

Further, the post is turning a blind eye to another important source of heterogeneity: changes in the actual intervention. We have successfully delivered interventions to students, in two pre-registered trials: Yeager et al., 2016, and Yeager et al., 2019. But we don’t know very much at all yet about changing teachers or parents. And the manipulation with no effects in Rienzo was the teacher component, and the Foliano study also tried to change teachers and schools. These are good-faith studies but they’re ahead of the science. Here’s my essay on this. I think it’s important for scientists to dig in and study why it’s so hard to create growth mindset environments, ones that allow the intervention to take root. I don’t see much value in throwing our hands up and abandoning lots of promising ideas just because we haven’t figured out the next steps yet.

In light of this, it seems odd to conclude that Carol Dweck’s involvement is the special ingredient to a successful study, which I can only assume is done to discredit her research.

First, it isn’t true. Outes et al. did a study with the world bank and found big effects (.1 to .2 SD), without Carol, and there’s the group of behavioral economists in Norway who replicated the intervention (I gave them our materials and they independently did the study).

Second, if I was a skeptic who wondered about independence (and I am a skeptic), I would ask for precisely the study we published in Nature: pre-registered analysis plan, independent data collection and processing, independent verification of the conclusions by MDRC, re-analysis using a multilevel Bayesian model (BCF) that avoids the problems with null hypothesis testing, and so on. But we already published that study, so it seems weird to be questioning the work now as if we haven’t already answered the basic questions of whether growth mindset effects are replicable by independent experimenters and evaluators.

The more sophisticated set of questions focuses on whether we know how to spread and scale the idea up in schools, whether we know how to train others to create effective growth mindset content, etc. And the answer to that is that we need to do lots of science, and quickly, to figure it out. And we’ll have to solve perennial problems with teacher PC and school culture change—problems that affect all educational interventions, not just mindset. I suspect that will be hard work that requires a big team and a lot of humility.

Oh, and the post mentions Li and Bates, but that’s just not a mindset intervention study. It’s a laboratory study of praise and its effects on attributions and motivation. It’s not that different from the many studies that Susan Gelman and others have done on essentialism and its effects on motivation. Those studies aren’t about long-term effects on grades or test scores so I don’t understand why this blog post mentions them. A funny heterogeneity-related footnote to Li and Bates is that they chose to do their study in one of the only places where mindset didn’t predict achievement in the PISA — rural China — while the original study was done in the U.S., where mindset robustly predicts achievement.

Children’s Variety Seeking in Food Choices

Margaret Echelbarger et al. write:

Across three studies, we examine the variety selections of 329 children (4–9 years of age) and 81 adults in the food domain. In studies 1 and 2, we find that, like adults, children prefer to diversify their selections given no established preference for one item over another. In study 3, we find that children (4–9 years) diversify their selections more and choose more healthy options when choosing items simultaneously (all on one day) versus sequentially (across several days). Together, our results provide novel insight into the potential for variety to serve as a tool to promote greater well-being in childhood.

If variation in effects is so damn important and so damn obvious, why do we hear so little about it?

Earlier today we posted, “To Change the World, Behavioral Intervention Research Will Need to Get Serious About Heterogeneity,” and commenters correctly noted that this point applies not just in behavioral research but also in economics, public health, and other areas.

I wanted to follow this up with a question:

If variation in effects is so damn important and so damn obvious, why do we hear so little about it?

Here’s my quick response:

It’s difficult to estimate variability in treatment effects (recall the magic number 16), and in statistics we’re often trained to think that if something can’t be measured or estimated precisely, that it can be safely ignored.

When I talk about embracing variation and accepting uncertainty, this is one reason why.
P.S. Thanks to Diana for the above photo of Sisi, who is really good at curve fitting and choice of priors.

“To Change the World, Behavioral Intervention Research Will Need to Get Serious About Heterogeneity”

Beth Tipton, Chris Bryan, and David Yeager write:

The increasing influence of behavioral science in policy has been a hallmark of the past decade, but so has a crisis of confidence in the replicability of behavioral science findings. In this essay, we describe a nascent paradigm shift in behavioral intervention research—a heterogeneity revolution—that we believe these two historical trends have already set in motion. The emerging paradigm recognizes that the unscientific samples that currently dominate behavioral intervention research cannot produce reliable estimates of an intervention’s real-world impact. Similarly, unqualified references to an intervention’s “true effect” are rarely warranted. Rather, the variation in effect estimates across studies that defines the current replication crisis is to be expected, even in the absence of false positives, as long as heterogeneous effects are studied without a systematic approach to sampling.

I agree! I’ve been ranting about this for a long time—hey, here’s a post from 2005, not long after we started this blog, and here’s another from 2009 . . . I guess there’s a division of labor on this one: I rant and Tipton et al. do something about it.

From one standpoint, the idea of varying treatment effects is obvious. But, when you look at what people do, this sort of variation is typically ignored. When I had my PhD training in the 1980s, we were taught all about causal inference. We learned randomization inference, we learned Bayesian inference, but it was always a model with constant treatment effect. Statistics textbooks—including my own!—always start with the model of constant treatment effect, only including interactions as an option.

And the problem’s not just with statisticians. Behavioral scientists have also been stunningly unreflective regarding the relevance of varying treatment effects to their experimental study. For example, here’s an email I received a few years ago from a prominent psychology researcher: not someone I know personally, but a prominent, very well connected professor at a leading East Coast private university that’s not Cornell. In response to a criticism I gave regarding a paper that relied entirely on data from a self-selected sample of 100 women from the Internet, and 24 undergraduates, the prominent professor wrote:

Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans.

The paper in question did not attempt to generalize to “all humans,” just to women of childbearing age. The title and abstract to the paper simply refer to “women” with no qualifications, and there is no doubt in my mind that the authors and anyone else who found this study to be worth noting) is interested in some generalization to a larger population.

The point is that this leading psychology researcher who wrote me that email was so deep into the constant-treatment-effect mindset that he didn’t just think that particular study was OK, he also thought it was “absurd” to be concerned about the non-representativeness of a sample in a psychology experiment.

So that was a long digression. The point is that the message sent by Tipton, Bryan, and Yeager, while commonsensical and clear, is not so apparent. For whatever reason, it’s taken people awhile to come to this point?

Why? For one thing, interactions are hard to estimate. Remember 16. So, for a long time we’ve had this attitude that, since interactions are hard—sometimes essentially impossible—to identify from data, we might as well just pretend they don’t exist. It’s a kind of Pascal’s wager or bet-on-sparsity principle.

More recently, though, I’ve been thinking we need to swallow our pride and routinely model these interactions, structuring our models so that the interactions we estimate make sense. Some of this structuring can be done using informative priors, some of it can be done using careful choices of functional forms and transformations (as in my effects-of-survey-incentives paper with Lauren). But, even if we can’t accurately estimate these interactions or even reliably identify their signs, it can be a mistake to just exclude them, which is equivalent to assuming they’re zero.

Also, let’s move from the overplayed topic of analysis to the still-fertile topic of design. If certain interactions or aspects of varying treatment effects are important, let’s design studies to specifically estimate these!

To put it another way: We’re already considering treatment interactions, all the time.

Why do I say that? Consider the following two pieces of advice we always give to researchers seeking to test out a new intervention:

1. Make the intervention as effective as possible. In statistics terms, multiplying the effect size by X is equivalent to multiplying the sample size by X^2. So it makes sense to do what you can to increase that effect size.

2. Apply the intervention to people who will be most receptive of the treatment, and in settings where the treatment will be most effective.

OK, fine. So how do you do 1 and 2? You can only do these if you have some sense of how the treatment effect can vary based on manipulable conditions (that’s item 1) and based on observed settings (that’s item 2). It’s a Serenity Prayer kind of thing.

So, yeah, understanding interactions is crucial, not just for interpreting experimental results, but for designing effective experiments that can yield conclusive findings.

Big changes coming

In our recent discussion of growth mindset interventions, Diana Senechal wrote:

We not only have a mixture of mindsets but actually benefit from the mixture—that we need a sense of limitation as well as of possibility. It is fine to know that one is better at certain things than at others. This allows for focus. Yes, it’s important to know that one can improve in areas of weakness. And one’s talents also contain weaknesses, so it’s helpful, overall, to know how to improve and to believe that it can happen. But it does not have to be an all-encompassing ideology, nor does it have to replace all belief in fixity or limitation. One day, someone will write a “revelatory” book about how the great geniuses actually knew they were bad at certain things–and how this knowledge allowed them to focus. That will then turn into some “big idea” and go to extremes of its own.

I agree. Just speaking qualitatively, as a student, teacher, sibling, and parent, I’d say the following:

– When I first heard about growth mindset as an idea, 20 or 30 years ago, it was a bit of a revelation to me: one of these ideas that is obvious and that we knew all along (yes, you can progress more if you don’t think of your abilities as fixed) but where hearing the idea stated in this way could change how we think.

– It seems clear that growth mindset can help some kids, but not all or even most, as these have to be kids who (a) haven’t already internalized growth mindset, and (b) are open and receptive to the idea. This is an issue in learning and persuasion and change more generally: For anything, the only people who will change are those who have not already changed and are willing to change. Hence a key to any intervention is to target the right people.

– If growth mindset becomes a dominant ideology, then it could be that fixed-mindset interventions could be helpful to some students. Indeed, maybe this is already the case.

The interesting thing is how much these above principles would seem to apply to so many psychological and social interventions. But when we talk about causal inference, we typically focus on the average treatment effect, and we often simply regression models in which the treatment effect is constant.

This suggests, in a God-is-in-every-leaf-of-every-tree way, that we’ve been thinking about everything all wrong for all these decades, focusing on causal identification and estimating “the treatment effect” rather than on these issues of receptivity to treatment.