Hey! Check out this short new introductory social science statistics textbook by Elena Llaudet and Kosuke Imai

Elena Llaudet points us to this new textbook for complete beginners. Here’s the table of contents:

1. Introduction [mostly on basic operations in R]

2. Estimating Causal Effects with Randomized Experiments [goes through an experiment on class sizes and test scores]

3. Inferring Population Characteristics via Survey Research [goes through the example of who in U.K. supported Brexit]

4. Predicting Outcomes Using Linear Regression [example of predicting GDP from night-time light emissions]

5. Estimating Causal Effects with Observational Data [example of estimating effects of Russian TV on Ukrainians’ voting behavior]

6. Probability [distributions, law of large numbers, central limit theorem]

7. Quantifying Uncertainty [estimation, confidence intervals, hypothesis testing]

And the whole thing is less than 250 pages! I haven’t looked at the whole book, but what I’ve seen is very impressive. Also, it’s refreshing to see an intro book proceeding from an entirely new perspective rather than just presenting the same old sequence of topics. There are lots of intro statistics books out there, with prices ranging from $0 to $198.47. This one is different, in a good way.

Seems like a great first book on statistics, especially for the social sciences—but not really limited to the social sciences either, as the general concepts of measurement, comparison, and variation arise in all application areas.

After you read, or teach out of, Llaudet and Imai’s new book, I recommend our own Regression and Other Stories, which I love so much that Aki and I have almost finished an entirely new book full of stories, activities, and demonstrations that can be used when teaching that material—but Regression and Other Stories is a lot for students who don’t have previous statistical background, so it’s good to see this new book as a starting point. As the title says, it’s a friendly and practical introduction!

How did Bill James get this one wrong on regression to the mean? Here are 6 reasons:

I’m a big fan of Bill James, but I think he might be picking up the wrong end of the stick here.

The great baseball analyst writes about what he calls the Law of Competitive Balance. His starting point is that teams that are behind are motivated to work harder to have a chance of winning, which moves them to switch to high-variance strategies such as long passes in football (more likely to score a touchdown, also more likely to get intercepted), etc. Here’s Bill James:

Why was there an increased chance of a touchdown being thrown?

Because the team was behind.

Because the team was behind, they had an increased NEED to score.

Because they had an increased need to score points, they scored more points.

That is one of three key drivers of The Law of Competitive Balance: that success increases when there is an increased need for success. This applies not merely in sports, but in every area of life. But in the sports arena, it implies that the sports universe is asymmetrical. . . .

Because this is true, the team which has the larger need is more willing to take chances, thus more likely to score points. The team which is ahead gets conservative, predictable, limited. This moves the odds. The team which, based on their position, would have a 90% chance to win doesn’t actually have a 90% chance to win. They may have an 80% chance to win; they may have an 88% chance to win, they may have an 89.9% chance to win, but not 90.

I think he’s mixing a correct point here with an incorrect point.

James’s true statement is that, as he puts it, “there is an imbalance in the motivation of the two teams, an imbalance in their willingness to take risks.” The team that’s behind is motivated to switch to strategies that increase the variance of the score differential, even at the expense of lowering its expected score differential. Meanwhile, the team that’s ahead is motivated to switch to strategies that decrease the variance of the score differential, even at the expense of lowering its expected score differential. In basketball, it can be as simple as the team that’s behind pushing up the pace and the team that’s ahead slowing things down. The team that’s trailing is trying to have some chance of catching up—their goal is to win, not to lose by a smaller margin; conversely, the team that’s in the lead is trying to minimize the chance of the score differential going to zero, not to run up the score. As James says, these patterns are averages and won’t occur from play to play. Even if you’re behind by 10 in a basketball game with 3 minutes to play, you’ll still take the open layup rather than force the 3-pointer, and even if you’re ahead by 10, you’ll still take the open shot with 20 seconds left on the shot clock rather than purely trying to eat up time. But on average the logic of the game leads to different strategies for the leading and trailing teams, and that will have consequences on the scoreboard.

James’s mistake is to think, when this comes to probability of winning, that this dynamic on balance favors the team that’s behind. When strategies are flexible, the team that’s behind does not necessarily increase its probability of winning relative to what that probability would be if team strategies were constant. Yes, the team that’s behind will use strategies to increase the probability of winning, but the team that’s ahead will alter its strategy too. Speeding up the pace of play should, on average, increase the probability of winning for the trailing team (for example, increasing the probability from, I dunno, 10% to 15%), but meanwhile the team that’s ahead is slowing down the pace of play, which should send that probability back down. On net, will this favor the leading team or the trailing team when it comes to win probability? It will depend on the game situation. In some settings (for example, a football game where the team that’s ahead has the ball on first down with a minute left), it will favor the team that’s ahead. In other settings it will go the other way.

James continues:

That is one of the three key drivers of the Law of Competitive Balance. The others, of course, are adjustments and effort. When you’re losing, it is easier to see what you are doing wrong. Of course a good coach can recognize flaws in their plan of attack even when they are winning, but when you’re losing, they beat you over the head.

I don’t know about that. As a coach myself, I could just as well argue the opposite point, as follows. When you’re winning, you can see what works while having the freedom to experiment and adapt to fix what doesn’t work. But when you’re losing, it can be hard to know where to start or have a sense of what to do to improve.

Later on in his post, James mentions that, when you’re winning, part of that will be due to situational factors that won’t necessarily repeat. The quick way to say that is that, when you’re winning, part of your success is likely to be from “luck”; a formulation that I’m OK with as long as we take this term generally enough to refer to factors that don’t necessarily repeat, such as pitcher/batter matchups, to take an example from James’s post.

But James doesn’t integrate this insight into his understanding of the law of competitive balance. Instead, he writes:

If a baseball team is 20 games over .500 one year, they tend to be 10 games over the next. If a team is 20 games under .500 one year, they tend to be 10 games under the next year. If a team improves by 20 games in one year (even from 61-101 to 81-81) they tend to fall back by 10 games the next season. If they DECLINE by 10 games in a year, they tend to improve by 5 games the next season.

I began to notice similar patterns all over the map. If a batter hits .250 one year and .300 the next, he tends to hit about .275 the third year. Although I have not demonstrated that similar things happen in other sports, I have no doubt that they do. I began to wonder if this was actually the same thing happening, but in a different guise. You get behind, you make adjustments. You lose 100 games, you make adjustments. You get busy. You work harder. You take more chances. You win 100 games, you relax. You stand pat.

James’s description of the data is fine; his mistake is to attribute these changes to teams “making adjustments” or “standing pat.” That could be, but it could also be that teams that win 100 games “push harder” and that teams that lose 100 games “give up.” The real point is statistical, which is that this sort of “regression to the mean” will happen without any such adjustment effects, just from “luck” or “random variation” or varying situational factors.

Here’s a famous example from Tversky and Kahneman (1973):

The instructors in a flight school adopted a policy of consistent positive reinforcement recommended by psychologists. They verbally reinforced each successful execution of a flight maneuver. After some experience with this training approach, the instructors claimed that contrary to psychological doctrine, high praise for good execution of complex maneuvers typically results in a decrement of performance on the next try.

Actually, though:

Regression is inevitable in flight maneuvers because performance is not perfectly reliable and progress between successive maneuvers is slow. Hence, pilots who did exceptionally well on one trial are likely to deteriorate on the next, regardless of the instructors’ reaction to the initial success. The experienced flight instructors actually discovered the regression but attributed it to the detrimental effect of positive reinforcement.

“Performance is not perfectly reliable and progress between successive maneuvers is slow”: That describes pro sports!

As we write in Regression and Other Stories, the point here is that a quantitative understanding of prediction clarifies a fundamental qualitative confusion about variation and causality. From purely mathematical considerations, it is expected that the best pilots will decline, relative to the others, while the worst will improve in their rankings, in the same way that we expect daughters of tall mothers to be, on average, tall but not quite as tall as their mothers, and so on.

I was surprised to see Bill James make this mistake. All the years I’ve read him writing about the law of competitive balance and the plexiglass principle, I always assumed that he’d understood this as an inevitable statistical consequence of variation without needing to try to attribute it to poorly-performing teams trying harder etc.

How did he get this one so wrong? Here are 6 reasons.

This raises a new question, which is how could such a savvy analyst make such a basic mistake? I have six answers:

1. Multiplicity. Statistics is hard, and if you do enough statistics, you’ll eventually make some mistakes. I make mistakes too! It just happens that this “regression to the mean” fallacy is a mistake that James made.

2. It’s a basic mistake and an important mistake, but it’s not a trivial mistake. Regression to the mean is a notoriously difficult topic to teach (you can cruise over to chapter 6 of our book and see how we do; maybe not so great!).

3. Statistics textbooks, including my own, are full of boring details, so I can see that, whether or not Bill James has read any such books, he wouldn’t get so much out of them.

4. In his attribution of regression to the mean, James is making an error of causal reasoning and a modeling error, but it’s not a prediction error. The law of competitive balance and the plexiglass principle give valid predictions, and they represent insights that were not widely available in baseball (and many other fields) before James came along. Conceptual errors aside, James was still moving the ball forward, as it were. When he goes beyond prediction in his post, for example making strategy recommendations, I’m doubtful, but I’m guessing that the main influence on readers of his “law of competitive balance” is to the predictive part.

5. Hero worship. The man is a living legend. That’s great—he deserves all his fame—but the drawback is that maybe it’s a bit too easy for him to fall for his own hype and not question himself or fully hear criticism. We’ve seen the same thing happen with baseball and political analyst Nate Silver, who continues to do excellent work but sometimes can’t seem to digest feedback from outsiders.

6. Related to point 5 is that James made his breakthroughs by fighting the establishment. For many decades he’s been saying outrageous things and standing up for his outrageous claims even when they’ve been opposed by experts in the field. So he keeps doing it, which in some ways is great but can also lead him astray, by trusting his intuitions too much and not leaving himself more open for feedback.

I guess we could say that, in sabermetrics, James was on a winning streak for a long time so he relaxes. He stands pat. He has less motivation to see what’s going wrong.

P.S. Again, I’m a big fan of Bill James. It’s interesting when smart people make mistakes. When dumb people make mistakes, that’s boring. When someone who’s thought so much about statistics makes such a basic statistical error, that’s interesting to me. And, as noted in item 4 above, I can see how James could have this misconception for decades without it having much negative effect on his work.

P.P.S. Just to clarify: Bill James made two statements. The first was predictive and correct; the second was causal and misinformed.

His first, correct statement is that there is “regression to the mean” or “competitive balance” or “plexiglas” or whatever you want to call it: players or teams that do well in time 1 tend to decline in time 2, and players or teams that do poorly in time 2 tend to improve in time 2. This statement, or principle, is correct, and it can be understood as a general mathematical or statistical pattern that arises when correlations are less than 100%. This pattern is not always true—it depends on the joint distribution of the before and after measurements (see here) but it is typically the case.

His second, misinformed statement is that this is caused by players or teams that are behind at time 1 being more innovative or trying harder and players or teams that are ahead at time 2 being complacent or standing pat. This statement is misinformed because the descriptive phenomenon of regression-to-the-mean or competitive-balance or plexiglas will happen even in the absence of any behavioral changes. And, as discussed in the above post, behavioral changes can go in either direction; there’s no good reason to think that, when both teams perform strategic adjustments, that these adjustments on net will benefit the team that’s behind.

This is all difficult because it is natural to observe the first, correct predictive statement and from this to mistakenly infer the second, misinformed causal statement. Indeed this inference is such a common error that it is a major topic in statistics and is typically covered in introductory texts. We devote a whole chapter to it in Regression and Other Stories, and if you’re interested in understanding this I recommend you read chapter 6; the book is freely available online.

For the reasons discussed above, I’m not shocked that Bill James made this error: for his purposes, the predictive observation has been more important than the erroneous causal inference, and he figured this stuff out on his own, without the benefit or hindrance of textbooks, and given his past success as a rebel, I can see how it can be hard for him to accept when outsiders point out a subtle mistake. But, as noted in the P.S. above, when smart people get things wrong, it’s interesting; hence this post.

Blogs > Twitter again

As we’ve discussed many times, I prefer blogs to twitter because in a blog you can have a focused conversation where you explain your ideas in detail, whereas twitter seems like more of a place for position-taking.

An example came up recently that demonstrates this point. Jennifer sent me a blurb for her causal inference conference and I blogged it. This was an announcement and not much more; it could’ve been on twitter without any real loss of information. A commenter then shot back:

Do you see how your policies might possibly negatively impact an outlier such as myself, when you arbitrarily reward contestants for uncovering effects you baked in? How do you know winners just haven’t figured out how you think about manipulating data to find effects? How far removed from my personal, actual, non-ergodic life are your statistical stories, and what policies that impede me unintentionally are you contributing to?

OK, this has more words than your typically twitter post, but if I saw it on twitter I’d be cool with it: it’s an expression of strong disagreement.

It’s the next step where things get interesting. When I saw the above comment, my quick reaction is, “What a crank!” And of course I have no duty to respond at all; responding to blog comments is something I can do for fun when I have the time for it: it can be helpful to explore the limits of what can be communicated. In a twitter setting I think the appropriate response would be some snappy response.

But this is a blog, not twitter, so I replied as follows:

That’s a funny way of putting things! I’d say that if you don’t buy the premise of this competition, then you don’t have to play. Kinda like if you aren’t interested in winter sports you don’t need to watch the olympics right now. I guess you might reply that our tax money is (indirectly) funding this competition, but then again our tax money funds the olympics too.

Getting to the topic at hand: No, I don’t know that the research resulting from this sort of competition will ultimately improve education policy. Or, even if does, it presumably won’t improve everyone’s education, and it could be that students who are similar to you in some ways will be among those who end up with worse outcomes. All I can say is that this sort of question—variation in treatment effects, looking at effects on individuals, not just on averages—is a central topic of modern causal inference and has been so for awhile. So, to the extent that you’re interested in evaluating policies in this way, I think this sort of competition is going in the right direction.

Regarding specifics: I think that after the competition is over, the team that constructed it will publicly release the details of what they did. So at that point in the not-so-distant future, you can take a look, and, if you see problems with it, you can publish your criticisms. That could be useful.

I’m not saying this response of mine was perfect. I’m just saying that the blog format was well suited to a thoughtful response, a deepening of the intellectual exchange and a rhetorical de-escalation, which is kind of the opposite of position-taking on twitter.

P.S. Also relevant is this post by Rob Hyndman, A brief history of time series forecasting competitions. I don’t know anything about the history of causal inference competitions, or the extent to which these were inspired by forecasting competitions. The same general question arise, of what’s being averaged over.

Les Distributions a Priori pour l’Inférence Causale (my talk in Paris Tues 11 Oct 14h)

Here it is:

Les Distributions a Priori pour l’Inférence Causale

En l’inférence bayésienne on doit spécifier un modèle pour les données (donc un likelihood) et un modèle pour les paramètres (un loi a priori). Envisagez deux questions:
1. Pourquoi est-ce plus difficile de préciser le likelihood que le loi a priori?
2. Pour préciser le loi a priori, comment peut-on sauter entre la literature théorique (l’invariance, le tendance au loi normal, etc) et la literature appliquée (l’elicitation des experts, la robustesse, etc.)?
Je discuter ces questions dans la domaine de l’inférence causale: les lois a priori pour les effets causals, les coefficients de la regression, et les autres paramètres dans les modèles causals.

If you follow the link you’ll see that somewhere along the line they translated my title and abstract into English. It seems that the talk is supposed to be in English too, which I guess will make it a bit more coherent. The causal inference connection should be interesting. It’s not a talk about causal inference; it’s more that thinking about causal inference can give us some insight into setting up models. As is often the case, we can do more when we engage with subject-matter storylines.

“The distinction between exploratory and confirmatory research cannot be important per se, because it implies that the time at which things are said is important”

This is Jessica. Andrew recently blogged in response to an article by McDermott arguing that pre-registration has costs like being unfair to junior scholars. I agree with his view that pre-registration can be a pre-condition for good science but not a panacea, and was not convinced by many of the reasons presented in the McDermott article for being skeptical about pre-registration. For example, maybe it’s true that requiring pre-registration would favor those with more resources, but the argument given seemed quite speculative. I kind of doubt the hypothesis made that many researchers are trying out a whole bunch of studies and then pre-registering and publishing on the ones where things work out as expected. If anything, I suspect pre-pre-registration experimentation looks more like researchers starting with some idea of what they want to see then tweak their study design or definition of the problem until they get data they can frame as consistent with some preferred interpretation (a.k.a. design freedoms). Whether this is resource-intensive in a divisive way seems hard to comment on without more context. Anyway, my point in this post is not to further pile on the arguments in the McDermott critique, but to bring up certain more nuanced critiques of pre-registration that I have found useful for getting a wider perspective, and which all this reminded me of.

In particular, arguments that Chris Donkin gave in a talk in 2020 about work with Aba Szollosi on pre-registration (related papers here and here) caught my interest when I first saw the talk and have stuck with me. Among several points the talk makes, one is that pre-registration doesn’t deserve privileged status among proposed reforms because there’s no single strong argument for what problem it solves. The argument he makes is NOT that pre-registration isn’t often useful, both for transparency and for encouraging thinking. Instead, it’s that bundling up a bunch of reasons why preregistration is helpful (e.g., p-hacking, HARKing, blurred boundary between EDA and CDA) misdiagnoses the issues in some cases, and risks losing the nuance in the various ways that pre-registration can help. 

Donkin starts by pointing out how common arguments for pre-registration don’t establish privileged status. For example, if we buy the “nudge” argument that pre-registration encourages more thinking which ultimately leads to better research, then we have to assume that researchers by and large have all the important knowledge or wisdom they need to do good research inside of them, it’s just that they are somehow too rushed to make use of it. Another is that the argument that we need controlled error rates in confirmatory data analysis and thus a clear distinction between explanatory and confirmatory research implies that the time at which things are said is important. But, if we take that seriously we’re implying there’s somehow a causal effect of saying what we will find ahead of time that makes it more true later. In other domains however, like criminal law, it would seem silly though to argue that because an explanation was proposed after the evidence came in, it can’t be taken seriously. 

The problem, Donkin argues, is that the role of theory is often overlooked in strong arguments for pre-registration. In particular, the idea that we need a sharp contrast between exploratory versus confirmatory data analysis doesn’t really make sense when it comes to testing theory. 

For instance, Donkin argues that we regularly pretend that we have a random sample in CDA, because that’s what gives it its validity, and the barebones statistical argument for pre-registration is that with EDA we no longer have a random sample, invalidating our inferences. However, in light of the importance of this assumption that we have a random sample in CDA, post-hoc analysis is critical to confirming that we do. We should be poking the data in whatever ways we can think up to see if we can find any evidence that the assumptions required of CDA don’t hold. If not, we shouldn’t trust any tests we run anyway. (Of course, one could preregister a bunch of preliminary randomization checks. But the point seems to be that there are activities that are essentially EDA-ish that can be done only when the data comes in, challenging the default). 

When we see pre-registration as “solving” the problem of EDA/CDA overlap, we invert an important distinction related to why we expect something that happened before to happen again. The reason it’s okay for us rely on inductive reasoning like this is because we embed the inference in theory: the explanation motivates the reason why we expect the thing to happen again. Strong arguments for pre-registration as a fix for “bad” overlap implies that this inductive reasoning is the fundamental first principle, rather than being a tool embedded in our pursuit of better theory. In other words, taking preregistration too seriously as a solution implies we should put our faith in the general principle that the past repeats itself. But we don’t use statistics because they create valid inferences, but because they are a tool for creating good theories.

Overall, what Donkin seems to be emphasizing in this is that there’s a rhetorical risk to too easily accepting that pre-registration is the solution to a clear problem (namely, that EDA and CDA aren’t well separated). Despite the obvious p-hacking examples we may think of when we think about the value of pre-registration, buying too heavily into this characterization isn’t necessarily doing pre-registration a favor, because it’s hiding a lot of nuance in ways that pre-registration can help. For example, if you ask people why pre-registration is useful, different people may stress different reasons. If you give preregistration an elevated status for the supposed reason that it “solves” the problem of EDA and CDA not being well distinguished, then, similar to how any nuance in intended usage of NHST has been lost, you may lose the nuance of preregistration as an approach that can improve science, and increase pre-occupation with a certain way of (mis)diagnosing the problems. Devezer et al. (and perhaps others I’m missing) have also pointed out the slipperiness of placing too much faith in the EDA/CDA distinction. Ultimately, we need to be a lot more careful in stating what problems we’re solving with reforms like pre-registration.

Again, none of this is to take away from the value of pre-registration in many practical settings, but to point out some of the interesting philosophical questions thinking about it critically can bring up.

History, historians, and causality

Through an old-fashioned pattern of web surfing of blogrolls (from here to here to here), I came across this post by Bret Devereaux on non-historians’ perceptions of academic history. Devereaux is responding to some particular remarks from economics journalist Noah Smith, but he also points to some more general issues, so these points seem worth discussing.

Also, I’d not previously encountered Smith’s writing on the study of history, but he recently interviewed me on the subjects of statistics and social science and science reform and causal inference so that made me curious to see what was up.

Here’s how Devereaux puts it:

Rather than focusing on converting the historical research of another field into data, historians deal directly with primary sources . . . rather than engaging in very expansive (mile wide, inch deep) studies aimed at teasing out general laws of society, historians focus very narrowly in both chronological and topical scope. It is not rare to see entire careers dedicated to the study of a single social institution in a single country for a relatively short time because that is frequently the level of granularity demanded when you are working with the actual source evidence ‘in the raw.’

Nevertheless as a discipline historians have always11 held that understanding the past is useful for understanding the present. . . . The epistemic foundation of these kinds of arguments is actually fairly simple: it rests on the notion that because humans remain relatively constant situations in the past that are similar to situations today may thus produce similar outcomes. . . . At the same time it comes with a caveat: historians avoid claiming strict predictability because our small-scale, granular studies direct so much of our attention to how contingent historical events are. Humans remain constant, but conditions, technology, culture, and a thousand other things do not. . . .

He continues:

I think it would be fair to say that historians – and this is a serious contrast with many social scientists – generally consider strong predictions of that sort impossible when applied to human affairs. Which is why, to the frustration of some, we tend to refuse to engage counter-factuals or grand narrative predictions.

And he then quotes a journalist, Matthew Yglesias, who wrote, “it’s remarkable — and honestly confusing to visitors from other fields — the extent to which historians resist explicit reasoning about causation and counterfactual analysis even while constantly saying things that clearly implicate these ideas.” Devereaux responds:

We tend to refuse to engage in counterfactual analysis because we look at the evidence and conclude that it cannot support the level of confidence we’d need to have. . . . historians are taught when making present-tense arguments to adopt a very limited kind of argument: Phenomenon A1 occurred before and it resulted in Result B, therefore as Phenomenon A2 occurs now, result B may happen. . . . The result is not a prediction but rather an acknowledgement of possibility; the historian does not offer a precise estimate of probability (in the Bayesian way) because they don’t think accurately calculating even that is possible – the ‘unknown unknowns’ (that is to say, contingent factors) overwhelm any system of assessing probability statistically.

This all makes sense to me. I just want to do one thing, which is to separate two ideas that I think are being conflated here:

1. Statistical analysis: generalizing from observed data to a larger population, a step that can arise in various settings including sampling, causal inference, prediction, and modeling of measurements.

2. Causal inference: making counterfactual statements about what would have happened, or could have happened, had some past decision been made differently, or making predictions about potential outcomes under different choices in some future decision.

Statistical analysis and causal inference are related but are not the same thing.

For example, if historians gather data on public records from some earlier period and then make inference about the distributions of people working at that time in different professions, that’s a statistical analysis but that does not involve causal inference.

From the other direction, historians can think about causal inference and use causal reasoning without formal statistical analysis or probabilistic modeling of data. Back before he became a joke and a cautionary tale of the paradox of influence, historian Niall Ferguson edited a fascinating book, Virtual History: Alternatives and Counterfactuals, a book of essays by historians on possible alternative courses of history, about which I wrote:

There have been and continue to be other books of this sort . . . but what makes the Ferguson book different is that he (and most of the other authors in his book) are fairly rigorous in only considering possible actions that the relevant historical personalities were actually considering. In the words of Ferguson’s introduction: “We shall consider as plausible or probable only those alternatives which we can show on the basis of contemporary evidence that contemporaries actually considered.”

I like this idea because it is a potentially rigorous extension of the now-standard “Rubin model” of causal inference.

As Ferguson puts it,

Firstly, it is a logical necessity when asking questions about causality to pose ‘but for’ questions, and to try to imagine what would have happened if our supposed cause had been absent.

And the extension to historical reasoning is not trivial, because it requires examination of actual historical records in order to assess which alternatives are historically reasonable. . . . to the best of their abilities, Ferguson et al. are not just telling stories; they are going through the documents and considering the possible other courses of action that had been considered during the historical events being considered. In addition to being cool, this is a rediscovery and extension of statistical ideas of causal inference to a new field of inquiry.

See also here. The point is that it was possible for Ferguson et al. to do formal causal reasoning, or at least consider the possibility of doing it, without performing statistical analysis (thus avoiding the concern that Devereaux raises about weak evidence in comparative historical studies).

Now let’s get back to Devereaux, who writes:

This historian’s approach [to avoid probabilistic reasoning about causality] holds significant advantages. By treating individual examples in something closer to the full complexity (in as much as the format will allow) rather than flattening them into data, they can offer context both to the past event and the current one. What elements of the past event – including elements that are difficult or even impossible to quantify – are like the current one? Which are unlike? How did it make people then feel and so how might it make me feel now? These are valid and useful questions which the historian’s approach can speak to, if not answer, and serve as good examples of how the quantitative or ’empirical’ approaches that Smith insists on are not, in fact, the sum of knowledge or required to make a useful and intellectually rigorous contribution to public debate.

That’s a good point. I still think that statistical analysis can be valuable, even with very speculative sampling and data models, but I agree that purely qualitative analysis is also an important part of how we learn from data. Again, this is orthogonal to the question of when we choose to engage in causal reasoning. There’s no reason for bad data to stop us from thinking causally; rather, the limitations in our data merely restrict the strengths of any causal conclusions we might draw.

The small-N problem

One other thing. Devereaux refers to the challenges of statistical inference: “we look at the evidence and conclude that it cannot support the level of confidence we’d need to have. . . .” That’s not just a problem with the field of history! It also arises in political science and economics, where we don’t have a lot of national elections or civil wars or depressions, so generalizations necessarily rely on strong assumptions. Even if you can produce a large dataset with thousands of elections or hundreds of wars or dozens of business cycles, any modeling will implicitly rely on some assumption of stability of a process over time, and assumption that won’t necessarily make sense given changes in political and economic systems.

So it’s not really history versus social sciences. Rather, I think of history as one of the social sciences (as in my book with Jeronimo from a few years back), and they all have this problem.

The controversy

After writing all the above, I clicked through the link and read the post by Smith that Devereaux was arguing.

And here’s the funny thing. I found Devereaux’s post to be very reasonable. Then I read Smith’s post, and I found that to be very reasonable too.

The two guys are arguing against each other furiously, but I agree with both of them!

What gives?

As discussed above, I think Devereaux in his post provides an excellent discussion of the limits of historical inquiry. On the other side, I take the main message of Smith’s post to be that, to the extent that historians want to use their expertise to make claims about the possible effects of recent or new policies, they should think seriously about statistical inference issues. Smith doesn’t just criticizes historians here; he leads off by criticizing academic economists:

After having endured several years of education in that field, I [Smith] was exasperated with the way unrealistic theories became conventional wisdom and even won Nobel prizes while refusing to submit themselves to rigorous empirical testing. . . . Though I never studied history, when I saw the way that some professional historians applied their academic knowledge to public commentary, I started to recognize some of the same problems I had encountered in macroeconomics. . . . This is not a blanket criticism of the history profession . . . All I am saying is that we ought to think about historians’ theories with the same empirically grounded skepticism with which we ought to regard the mathematized models of macroeconomics.

By saying that I found both Devereaux and Smith to be reasonable, I’m not claiming they have no disagreements. I think their main differences come because they’re focusing on two different things. Smith’s post is ultimately about public communication and the things that academic say in the public discourse (things like newspaper op-eds and twitter posts) with relevance to current political disputes. And, for that, we need to consider the steps, implicit or explicit, that commentators take to go from their expertise to the policy claims they make. Devereaux is mostly writing about academic historians in their professional roles. With rare exceptions, academic history is about getting the details right, and even popular books of history typically focus on what happened, and our uncertainty about what happened, not on larger theories.

I guess I do disagree with this statement from Smith:

The theories [from academic history] are given even more credence than macroeconomics even though they’re even less empirically testable. I spent years getting mad at macroeconomics for spinning theories that were politically influential and basically un-testable, then I discovered that theories about history are even more politically influential and even less testable.

Regarding the “less testable” part, I guess it depends on the theories—but, sure, many theories about what have happened in the past can be essentially impossible to test, if conditions have changed enough. That’s unavoidable. As Devereaux replies, this is not a problem with the study of history; it’s just the way things are.

But I can’t see how Smith could claim with a straight face that theories from academic history are “given more credence” and are “more politically influential” than macroeconomics. The president has a council of economic advisers, there are economists at all levels of the government, or if you want to talk about the news media there are economists such as Krugman, Summers, Stiglitz, etc. . . . sure, they don’t always get what they want when it comes to policy, but they’re quoted endlessly and given lots of credence. This is also the case in narrower areas, for example James Heckman on education policy or Angus Deaton on deaths of despair: these economists get tons of credence in the news media. There are no academic historians with that sort of influence. This has come up before: I’d say that economics now is comparable to Freudian psychology in the 1950s in its influence on our culture:

My best analogy to economics exceptionalism is Freudianism in the 1950s: Back then, Freudian psychiatrists were on the top of the world. Not only were they well paid, well respected, and secure in their theoretical foundations, they were also at the center of many important conversations. Even those people who disagreed with them felt the need to explain why the Freudians were wrong. Freudian ideas were essential, leaders in that field were national authorities, and students of Freudian theory and methods could feel that they were initiates in a grand tradition, a priesthood if you will. Freudians felt that, unlike just about everybody else, they treated human beings scientifically and dispassionately. What’s more, Freudians prided themselves on their boldness, their willingness to go beyond taboos to get to the essential truths of human nature. Sound familiar?

When it comes to influence in policy or culture or media, academic history doesn’t even come close to Freudianism in the 1950s or economics in recent decades.

This is not to say we should let historians off the hook when they make causal claims or policy recommendations. We shouldn’t let anyone off the hook. In that spirit, I appreciate Smith’s reminder of the limits of historical theories, along with Devereaux’s clarification of what historians really do when they’re doing academic history (as opposed to when they’re slinging around on twitter).

Why write about this at all?

As a statistician and political scientist, I’m interested in issues of generalization from academic research to policy recommendations. Even in the absence of any connection with academic research, people will spin general theories—and one problem with academic research is that it can give researchers, journalists, and policymakers undue confidence in bad theories. Consider, for example, the examples of junk science promoted over the years by the Freakonomics franchise. So I think these sorts of discussions are important.

Update on estimates of effects of college football games on election outcomes

Anthony Fowler writes:

As you may recall, Pablo Montagnes and I wrote a paper on college football and elections in 2015 where we looked at additional evidence and concluded that the original Healy, Malhota, Mo (2010) result was likely a false positive. You covered this here and here.

Interestingly, the story isn’t completely over. Graham, Huber, Malhotra, and Mo have a forthcoming JOP paper claiming that the evidence is mostly supportive of the original hypothesis. They added some new observations, pooled all the data together, and re-ran some specifications that are very similar to those of the original Healy et al. paper. The results got weaker, but they’re still mostly in the expected direction.

Pablo and I wrote a reply to this paper, available here, which is also forthcoming in the JOP. We ran some simulations showing that their results are in line with what we would expect if the original result was a chance false positive, and their results are much weaker than what we would expect if the original result was a genuine effect of the magnitude reported in the original paper.

They wrote a reply to our reply, which we only learned about recently when it appeared on the JOP site.

We have written a brief reply to their reply. We assume that the JOP won’t be interested in publishing yet another reply, but if you think this is interesting, we would greatly appreciate you covering this topic and sharing our reply.

There is a lot more to discuss. For example, Graham et al. say that they are conducting an independent replication using the principles of open science. But the data and design are very similar to the original paper, so this is neither independent nor a replication. They argue that they pre-registered their analyses, but they had already seen very similar specifications run on very similar data, so it’s not so clear that we should think of these as pre-registered analyses. They appear to have deviated from their pre-analysis plan by failing to report results using only the out-of-sample data (they just show results using the in-sample data and the pooled data, but not the out-of-sample data). They also exercise some degrees of freedom (and deviate from their pre-analysis plan) in deciding what should count as out of sample.

I have three quick comments:

First, much of the above-linked discussion concerns what counts as a preregistered replication. It’s important for people to consider these issues carefully but they don’t interest me so much, at least not in this setting where, ultimately, the amount of data is not large enough to learn much of anything without some strong theory.

Second, although I’m generally in sympathy with the arguments made by Fowler and Montagnes, I don’t like their framing of the problem in terms of “false positives.” I don’t think the effect of a football game on the outcome of an election is zero. What I do think (until persuaded by strong evidence to the contrary) is that these effects are likely to be small, are highly variable when they’re not small, and won’t show up as large effects in average analyses. In practice, that’s not a lot different than calling these effects “false positives,” but I don’t like going around saying that effects are zero. It’s enough to say that they are not large and predictable, which would be necessary for them to be detectable from the usual statistical analysis.

Third, when reading Fowler and Montagnes’s final points regarding political accountability, I’m reminded of our work on the piranha principle: Once you accept the purportedly large effects of football games, shark attacks, etc., where do you stop? To put it another way, it’s not impossible that college football games have large and consistent effects on election outcomes, but there are serious theoretical problems with such a model of the world, because then you have to either have a theory of what’s so special about college football or else you have a logjam of large effects from all sorts of inputs.

Some concerns about the recent Chetty et al. study on social networks and economic inequality, and what to do next?

I happened to receive two different emails regarding a recently published research paper.

Dale Lehman writes:

Chetty et al. (and it is a long et al. list) have several publications about social and economic capital (see here for one such paper, and here for the website from which the data can also be accessed). In the paper above, the data is described as:

We focus on Facebook users with the following attributes: aged between 25 and 44 years who reside in the United States; active on the Facebook platform at least once in the previous 30 days; have at least 100 US-based Facebook friends; and have a non-missing residential ZIP code. We focus on the 25–44-year age range because its Facebook usage rate is greater than 80% (ref. 37). On the basis of comparisons to nationally representative surveys and other supplementary analyses, our Facebook analysis sample is reasonably representative of the national population.

They proceed to measure social and economic connectedness across counties, zip codes, and for graduates of colleges and high schools. The data is massive as is the effort to make sense out of it. In many respects it is an ambitious undertaking and one worthy of many kudos.

But I [Lehman] do have a question. Given their inclusion criteria, I wonder about selection bias when comparing counties, zip codes, colleges, or high schools. I would expect that the fraction of Facebook users – even in the targeted age group – that are included will vary across these segments. For example, one college may have many more of its graduates who have that number of Facebook friends and have used Facebook in the prior 30 days compared with a second college. Suppose the economic connectedness from the first college is greater than from the second college. But since the first college has a larger proportion of relatively inactive Facebook users, is it fair to describe college 1 as having greater connectedness?

It seems to me that the selection criteria make the comparisons potentially misleading. It might be accurate to say that the regular users of Facebook from college 1 are more connected than those from college 2, but this may not mean that the graduates from college 1 are more connected than the graduates from college 2. I haven’t been able to find anything in their documentation to address the possible selection bias and I haven’t found anything that mentions how the proportion of Facebook accounts that meet their criteria varies across these segments. Shouldn’t that be addressed?

That’s an interesting point. Perhaps one way to address it would be to preprocess the data by estimating a propensity to use facebook and then using this propensity as a poststratification variable in the analysis. I’m not sure. Lehman makes a convincing case that this is a concern when comparing different groups; that said, it’s the kind of selection problem we have all the time, and typically ignore, with survey data.

Richard Alba writes in with a completely different concern:

You may be aware of the recent research, published in Nature by the economist Raj Chetty and colleagues, purporting to show that social capital in the form of early-life ties to high-status friends provides a powerful pathway to upward mobility for low-status individuals. It has received a lot of attention, from The New York Times, Brookings, and no doubt other places I am not aware of.

In my view, they failed to show anything new. We have known since the 1950s that social capital has a role in mobility, but the evidence they develop about its great power is not convincing, in part because they fail to take into account how their measure of social capital, the predictor, is contaminated by the correlates and consequences of mobility, the outcome.

This research has been greeted in some media as a recipe for the secret sauce of mobility, and one of their articles in Nature (there are two published simultaneously) is concerned with how to increase social capital. In other words, the research is likely to give rise to policy proposals. I think it is important then to inform Americans about its unacknowledged limitations.

I sent my critique to Nature, and it was rejected because, in their view, it did not sufficiently challenge the articles’ conclusions. I find that ridiculous.

I have no idea how Nature decides what critiques to publish, and I have not read the Chetty et al. articles so I can’t comment on theme either, but I can share Alba’s critique. Here it is:

While the pioneering big-data research of Raj Chetty and his colleagues is transforming the long-standing stream of research into social mobility, their findings should not be exempt from critique.

Consider in this light the recent pair of articles in Nature, in which they claim to have demonstrated a powerful causal connection between early-life social capital and upward income mobility for individuals growing up in low-income families. According to one paper’s abstract, “the share of high-SES friends among individuals with low-SES—which we term economic connectedness—is among the strongest predictors of upward income mobility identified to date.”

But there are good reasons to doubt that this causal connection is as powerful as the authors claim. At a minimum, the social capital-mobility statistical relationship is significantly overstated.

This is not to deny a role for social capital in determining adult socioeconomic position. That has been well established for decades. As early as the 1950s, the Wisconsin mobility studies focused in part on what the researchers called “interpersonal influence,” measured partly in terms of high-school friends, an operationalization close to the idea in the Chetty et al. article. More generally, social capital is indisputably connected to labor-market position for many individuals because of the role social networks play in disseminating job information.

But these insights are not the same as saying that economic connectedness, i.e., cross-class ties, is the secret sauce in lifting individuals out of low-income situations. To understand why the articles’ evidence fails to demonstrate this, it is important to pay close attention to how the data and analysis are constructed. Many casual readers, who glance at the statements like the one above or read the journalistic accounts of the research (such as the August 1 article in The New York Times), will take away the impression that the researchers have established an individual-level relationship—that they have proven that individuals from low-SES families who have early-life cross-class relationships are much more likely to experience upward mobility. But, in fact, they have not.

Because of limitations in their data, their analysis is based on the aggregated characteristics of areas—counties and zip codes in this case—not individuals. This is made necessary because they cannot directly link the individuals in their main two sources of data—contemporary Facebook friendships and previous estimates by the team of upward income mobility from census and income-tax data. Hence, the fundamental relationship they demonstrate is better stated as: the level of social mobility is much higher in places with many cross-class friendships. The correlation, the basis of their analysis, is quite strong, both at the county level (.65) and at the zip-code level (.69).

Inferring that this evidence demonstrates a powerful causal mechanism linking social capital to the upward mobility of individuals runs headlong into a major problem: the black box of causal mechanisms at the individual level that can lie behind such an ecological correlation, where moreover both variables are measured for roughly the same time point. The temptation may be to think that the correlation reflects mainly, or only, the individual-level relationship between social capital and mobility as stated above. However, the magnitude of an area-based correlation may be deceptive about the strength of the correlation at the individual level. Ever since a classic 1950 article by W. S. Robinson, it has been known that ecological correlations can exaggerate the strength of the individual-level relationship. Sometimes the difference between the two is very large, and in the case of the Chetty et al. analysis it appears impossible given the data they possess to estimate the bias involved with any precision, because Robinson’s mathematics indicates that the individual-level correlations within area units are necessary to the calculation. Chetty et al. cannot calculate them.

A second aspect of the inferential problem lies in the entanglement in the social-capital measure of variables that are consequences or correlates of social mobility itself, confounding cause and effect. This risk is heightened because the Facebook friendships are measured in the present, not prior to the mobility. Chetty et al. are aware of this as a potential issue. In considering threats to the validity of their conclusion, they refer to the possibility of “reverse causality.” What they have in mind derives from an important insight about mobility—mobile individuals are leaving one social context for another. Therefore, they are also leaving behind some individuals, such as some siblings, cousins, and childhood buddies. These less mobile peers, who remain in low-SES situations but have in their social networks others who are now in high-SES ones, become the basis for the paper’s Facebook estimate of economic connectedness (which is defined from the perspective of low-SES adults between the ages of 25 and 44). This sort of phenomenon will be frequent in high-mobility places, but it is a consequence of mobility, not a cause. Yet it almost certainly contributes to the key correlation—between economic connectedness and social mobility—in the way the paper measures it.

Chetty et al. try to answer this concern with correlations estimated from high-school friendships, arguing that the timing purges this measure of mobility’s impact on friendships. The Facebook-based version of this correlation is noticeably weaker than the correlations that the paper emphasizes. In any event, demonstrating a correlation between teen-age economic connectedness and high mobility does not remove the confounding influence of social mobility from the latter correlations, on which the paper’s argument depends. And in the case of high-school friendships, too, the black-box nature of the causality behind the correlation leaves open the possibility of mechanisms aside from social capital.

This can be seen if we consider the upward mobility of the children of immigrants, surely a prominent part today of the mobility picture in many high-mobility places. Recently, the economists Ran Abramitzky and Leah Boustan have reminded us in their book Streets of Gold that, today as in the past, the children of immigrants, the second generation, leap on average far above their parents in any income ranking. Many of these children are raised in ambitious families, where as Abramitzky and Boustan put it, immigrants typically are “under-placed” in income terms relative to their abilities. Many immigrant parents encourage their children to take advantage of opportunities for educational advancement, such as specialized high schools or advanced-placement high-school classes, likely to bring them into contact with peers from more advantaged families. This can create social capital that boosts the social mobility of the second generation, but a large part of any effect on mobility is surely attributable to family-instilled ambition and to educational attainment substantially higher than one would predict from parental status. The increased social capital is to a significant extent a correlate of on-going mobility.

In sum, there is without doubt a causal linkage between social capital and mobility. But the Chetty et al. analysis overstates its strength, possibly by a large margin. To twist the old saw about correlation and causation, correlation in this case isn’t only causation.

I [Alba] believe that a critique is especially important in this case because the findings in the Chetty et al. paper create an obvious temptation for the formulation of social policy. Indeed, in their second paper in Nature, the authors make suggestions in this direction. But before we commit ourselves to new anti-poverty policies based on these findings, we need a more certain gauge of the potential effectiveness of social capital than the current analysis can give us.

I get what Alba is saying about the critique not strongly challenging the article’s conclusions. He’s not saying that Chetty et al. are wrong; it’s more that he’s saying there are a lot of unanswered questions here—a position I’m sure Chetty et al. would themselves agree with!

A possible way forward?

To step back a moment—and recall that I have not tried to digest the Nature articles or the associated news coverage—I’d say that Alba is criticizing a common paradigm of social science research in which a big claim is made from a study and the study has some clear limitations, so the researchers attack the problem in some different ways in an attempt to triangulate toward a better understanding.

There are two immediate reactions I’d like to avoid. The first is to say that the data aren’t perfect, the study isn’t perfect, so we just have to give up and say we’ve learned nothing. On the other direction is the unpalatable response that all studies are flawed so we shouldn’t criticize this one in particular.

Fortunately, nobody is suggesting either of these reactions. From one direction, critics such as Lehman and Alba are pointing out concerns but they’re not saying the conclusions of the Chetty et al. study are all wrong of that the study is useless; from the other, news reports do present qualifiers and they’re not implying that these results are a sure thing.

What we’d like here is a middle way—not just a rhetorical middle way (“This research, like all social science, has weaknesses and threats to validity, hence the topic should continue to be studied by others”) but a procedural middle way, a way to address the concerns, in particular to get some estimates of the biases in the conclusions resulting from various problems with the data.

Our default response is to say the data should be analyzed better: do a propensity analysis to address Lehman’s concern about who’s on facebook, and do some sort of multilevel model integrating individual and zipcode-level data to address Alba’s concern about aggregation. And this would all be fine, but it takes a lot of work—and Chetty et al. already did a lot of work, triangulating toward their conclusion from different directions. There’s always more analysis that could be done.

Maybe the problem with the triangulation approach is not the triangulation itself but rather the way it can be set up with a central analysis making a conclusion, and then lots of little studies (“robustness checks,” etc.) designed to support the main conclusion. What if the other studies were set up to estimate biases, with the goal not of building confidence in the big number but rather of getting a better, more realistic, estimate.

With this in mind, I’m thinking that a logical next step would be to construct a simulation study to get a sense of the biases arising from the issues raised by Lehman and Alba. We can’t easily gather the data required to know what these biases are, but it does seem like it should be possible to simulate a world in which different sorts of people are more or less likely to be on facebook, and in which there are local patterns of connectedness that are not simply what you’d get by averaging within zipcodes.

I’m not saying this would be easy—the simulation would have to make all sorts of assumptions about how these factors vary, and the variation would need to depend on relevant socioeconomic variables—but right now it seems to me to be a natural next step in the research.

One more thing

Above I stressed the importance and challenge of finding a middle ground between (1) saying the study’s flaws make it completely useless and (2) saying the study represents standard practice so we should believe it.

Sometimes, though, response #1 is appropriate. For example, the study of beauty and sex ratio or the study of ovulation and voting or the study claiming that losing an election for governor lops 5 to 10 years off your life—I think those really are useless (except as cautionary tales, lessons of research practices to avoid). How can I say this? Because those studies are just soooo noisy compared to any realistic effect size. There’s just no there there. Researchers can fool themselves because the think that if they have hundreds or thousands of data points, that they’re cool, and that if they have statistical significance, they’ve discovered something. We’ve talked about this attitude before, and I’ll talk about again; I just wanted to emphasize here that it doesn’t always make sense to take the middle way. Or, to put it another way, sometimes the appropriate middle way is very close to one of the extreme positions.

Some project opportunities for Ph.D. students!

Hey! My collaborators and I are working on a bunch of interesting, important projects where we could use some help. If you’d like to work with us, send me an email with your CV and any relevant information and we can see if there’s a way to fit you into the project. Any of these could be part of a Ph.D. thesis. And with remote collaboration, this is open to anyone—you don’t have to be at Columbia University. It would help if you’re a Ph.D. student in statistics or related field with some background in Bayesian inference.

There are other projects going on, but here are a few where we could use some collaboration right away. Please specify in your email which project you’d like to work on.

1. Survey on social conditions and health:

We’re looking for help adjusting to the methodological hit that a study suffered due to COVID shut down in the middle of implementing sampling design to refresh the cohort, as well as completing interviews for the ongoing cohort members. Will need consider sampling adjustments. Also they experimented with conducting some interviews via phone or zoom during the pandemic, as they were shorter than their regular 2 hr in-person interview, so it would be good to examine item missing and imputation strategy for key variables important for the analyses that are planned.

2. Measurement-error models:

This is something that I’m interested in for causal inference in general, also a recent example came up in climate science, where an existing Bayesian approach has problems, and I think we could do something good here by thinking more carefully about priors. In addition to the technical challenges, climate change is a very important problem to be working on.

3. Bayesian curve fitting for drug development:

This is a project with a pharmaceutical company to use hierarchical Bayesian methods to fit concentration response curves in drug discovery. The cool thing here is that have a pipeline with thousands of experiments and so we want an automated approach. This relates to our work on scalable computing, diagnostics, and model understanding, as well as specific issues of nonlinear hierarchical models.

4. Causal inference with latent data:

There are a few examples here of survey data where we’d like to adjust for pre-treatment variables which are either unmeasured or are measured with error. This is of interest for Bayesian modeling and causal inference, in particular the idea that we can improve upon the existing literature by using stronger priors, and also for the particular public health applications.

5. Inference for models identifying spatial locations:

This is for a political science project where we will be conducting a survey and asking people questions about nationalities and ethnic groups and using this to estimate latent positions of groups. Beyond the political science interest (for example, comparing mental maps of Democrats and Republicans), this relates to some research in computational neuroscience. It would be helpful to have a statistics student on the project because there are some challenging modeling and computational issues and it would be good for the political science student to be able to focus on the political science aspects of the project.

Weak separation in mixture models and implications for principal stratification

Avi Feller, Evan Greif, Nhat Ho, Luke Miratrix, and Natesh Pillai write:

Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. . . . We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps.

The paper’s all about maximum likelihood estimates and I don’t care about that at all, but the general principles are relevant to understanding causal inference with intermediate outcomes and fitting such models in Stan or whatever.

Does having kids really protect you from serious COVID‑19 symptoms?

Aleks pointed me to this article, which reports:

Epidemiologic data consistently show strong protection for young children against severe COVID-19 illness. . . . We identified 3,126,427 adults (24% [N = 743,814] with children ≤18, and 8.8% [N = 274,316] with youngest child 0–5 years) to assess whether parents of young children—who have high exposure to non-SARS-CoV-2 coronaviruses—may also benefit from potential cross-immunity. In a large, real-world population, exposure to young children was strongly associated with less severe COVID-19 illness, after balancing known COVID-19 risk factors. . , ,

My first thought was that parents are more careful than non-parents so they’re avoiding covid exposure entirely. But it’s not that: non-parents in the matched comparison had a lower rate of infections but a higher rate of severe cases; see Comparison 3 in Table 2 of the linked article.

One complicating factor is that they didn’t seem to have adjusted for whether the adults were vaccinated–that’s a big deal, right? But maybe not such an issue given that the study ended on 31 Jan 2021, and by then it seems that only 9% of Americans were vaccinated. It’s hard for me to know if this would be enough to explain the difference found in the article–for that it would be helpful to have the raw data, including the dates of these symptoms.

Are the data available? It says, “This article contains supporting information online at http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2204141119/-/DCSupplemental” but when I click on that link it just takes me to the main page of the article (https://www.pnas.org/doi/abs/10.1073/pnas.2204141119) so I don’t know whassup with that.

Here’s another thing. Given that the parents in the study were infected at a higher rate than the nonparents, it would seem that the results can’t simply be explained by parents being more careful. But could it be a measurement issue? Maybe parents were more likely to get themselves tested.

The article has a one-paragraph section on Limitations, but it does not consider any of the above issues.

I sent the above to Aleks, who added:

My thought is that the population of parents probably lives differently than non-parents: less urban, perhaps biologically healthier. They did match, but just doing matching doesn’t guarantee that the relevant confounders have truly been handled.

This paper is a big deal 1) because it’s used to support herd immunity 2) because it is used to argue against vaccination 3) because it doesn’t incorporate long Covid risks.

For #3, it might be possible to model out the impact, based on what we know about the likelihood of long-term issues, e.g. https://www.clinicalmicrobiologyandinfection.com/article/S1198-743X(22)00321-4/fulltext

Your point about the testing bias could be picked up by the number of asymptomatic vs asymptomatic cases, which would reveal a potential bias.

My only response here is that if the study ends on Jan 2021, I can’t see how it can be taken as an argument against vaccination. Even taking the numbers in Table 2 at face value, we’re talking about a risk reduction for severe covid from having kids of a factor of 1.5. Vaccines are much more effective than that, no? So even if having Grandpa sleep on the couch and be exposed to the grandchildren’s colds is a solution that works for your family, it’s not nearly as effective as getting the shot–and it’s a lot les convenient.

Aleks responds:

Looking at the Israeli age-stratified hospitalization dashboard, the hospitalization rates for unvaccinated 30-39-olds are almost 5x greater than for vaccinated & boosted ones. However, the hospitalization rates for unvaccinated 80+ is only about 30% higher.

After-work socializing . . . alcohol . . . car crashes?

Tony Williams writes:

I thought of this while reading a complaint on LinkedIn about after-work socialization always involving alcohol (and tobacco, for those so inclined).

There seems to be a lot of neat things that can address a nasty issue, DUI. I think it could make sense as an undergrad/MS applied paper idea or thesis.

Spatial differences (probably state level) on relaxation of social distancing policies. Companies, my guess mostly bigger companies located in or near major metropolitan areas, going to hybrid work schedules. Those companies also having different work-from-office days (though, honestly, Wed and Fri is the common one I’ve seen, and that’s obviously intended in part to not allow employees to live further away and “only” commute for two consecutive days).

Feel free to run with it if you’d like. Seems like a lot of natural variation that could be exploited for causal inference. I’d be interested in any results.

My reply: I have no idea, but my guess is that, any such idea, someone has looked into and maybe there’s even a literature on it. The literature would probably be full of the usual problems of selection on statistical significance, but anyone interested in this idea could check out any literature and then start from scratch from the raw data. Could be something interesting there if the signal is large enough compared to the variation.

Taking theory more seriously in psychological science

This is Jessica. Last week I virtually attended a workshop on What makes a good theory?, organized by Iris van Rooij, Berna Devezer, Josh Skewes, Sashank Varma, and Todd Wareham. Broadly, the premise is that what it means to do good theorizing in fields like psych has been neglected, including in reform conversations, compared to various ways of improving inference. The workshop brought together researchers interested in the role of theory from different fields (cogsci, philosophy, linguistics, CS, biology, etc.). Some background reading on the premise and how use of theory can be improved in psych and cogsci can be found here here and here. You can watch all the workshop keynotes here, and there will be a special issue of Computational Brain and Behavior including topics from some of these discussions in 2023. 

The workshop format provided a lot of time for participants to discuss various aspects of theory construction and evaluation, plus daily plenary sessions where results of the ongoing discussions were shared with everyone. I was struck by how little friction there seemed to be in these discussions, despite the fact that many participants were coming with strong views from their own theoretical perspectives. It made me realize how I’ve learned to associate serious discussion of theory and cognitive modeling with combative arguments, and, as someone who works in a very applied field, how rarely I get to have meta-level conversations about what it means to take theory seriously. 

I’ll summarize a few bits here, which should be interpreted as a very partial view of what was discussed.

 One theme was that it’s hard to define what theory is exactly. Across and even within fields what we consider theory can vary considerably. The slipperiness of trying to define theory in some universal way doesn’t preclude reasoning formally about it, but complicates it in that one must define, in a generalizable way, both the language in which a theory is described and the nature of the thing being described, e.g., functions and algorithms expressed in languages that map to string descriptions. See here for an example. A book by Marr came up repeatedly related to motivating different levels of analysis in theories of cognition, including the computational or functional claim about some aspect of cognition, the algorithmic or representational description, and implementation. 

Another takeaway for me was that theory is closely related to explanation, and is pretty certainly underdetermined by evidence. Again, see here for an example. I had to miss it but Martin Modrák led a session on identifiability as a theoretical virtue, a topic Manjari Narayan proposed in her keynote.

A slightly more specific view, proposed by Frank Jäkel, was that theory is talking about classes of models and how they relate. Similarly Artem Kaznatcheev proposed in his keynote that theory interconnects models and distributes modeling over a community. A view from math psych is that theorizing involves critically evaluating how to use formal models and experimental work to answer specific questions about natural phenomena. Another description of theory was as “candidate factive description,” as compared to models, which are tools with stricter representational and idealized elements. But the view that there is no bright line between theory and models was also proposed. Someone else mentioned that a prerequisite of any model is that it’s somehow embedded in theory.

All of this makes me think that the sort of colloquial theories that get described in motivating papers or interpreting evidence always exist somewhere along the boundaries of models, aiming to capture what isn’t fully understood, often in terms of causal explanations that are functional but not really probabilistic. I also found myself thinking a lot about the tension between probabilistic evidence and possibilistic explanations, where the latter often describe why something might be like it is, but have nothing to say about how likely it is. For example, in early stages of exploratory data analysis, one might start to generate hypotheses for why certain patterns exist, but these conjectures typically won’t encode information about prevalence, just possible mechanism. This can make colloquially-stated theory misleading, perhaps, where it describes some seemingly “intuitive” account that might be consistent with or inspired by some evidence but not representative of more common processes.  

Related to this, one dimension that came up several times was the relationship between theory and evidence. Sarahanne Field proposed a session exploring the extent to which bad data or other inputs to science can still be used to derive theory, and Marieke Woensdregt proposed a session on how theory is affected by sparse evidence, which is sometimes all that is available. Berna Devezer suggested Hasok Chang’s Inventing Temperature for a good example of bad evidence producing useful theory.

There’s a question about what it means to use a theory that’s not true, versus a model that’s not true. I found myself thinking about how theory-forward work can put one in a weird situation where you have to take the theory seriously even if you don’t think it’s right. My own experiences with applying Bayesian models of cognition fall in this category. Taking a theory seriously helps guarantee your attention is going to be there to see how it fails, i.e., trying to resolve ambiguities through data helps you understand what exactly is wrong in the assumptions you’re making. It can also be a generative tool that give you a strange lens, as discussed in Sashank Varma’s talk, that can lead you to knowledge you might not have otherwise discovered. But taking a theory very seriously can also be bad for learning, if it biases evidence collection too much for example. Theories structure the way we see the world. Embracing an idea while simultaneously being aware of the various ways in which it seems dismissable is a very familiar experience to me in my more theoretically-motivated modeling work, and can be hard to explain to those who have never tried to deal in these spaces.

Somewhat related, I often think about theory as having many different functions. Manjari Narayan commented on the various functions of statistical theories, including sometimes being used to make true statements, albeit of limited scope, versus sometimes being applied to make normative statements about how to do things without a truth commitment. 

One challenge in advocating for better theory in empirical fields, related to questions proposed by Devezer with help from Field, is that often what it looks like to do theory is left implicit. It’s not necessarily the kind of thing that’s taught the way modeling is. This raises the question of what it really means to approach a problem theoretically. The idea that what good theory is can be dependent on the domain or specific problem space came up a fair amount as well, with Dan Levenstein proposing a few sessions devoted to this. 

Tools for CS theory, especially complexity theory, came up multiple times, including in a discussion proposed by Jon Rawski and led with Artem Kaznatcheev. I like the idea of analyzing the tractability of searching for theory under different assumptions like truth commitments, since it helps illustrate why we have to be wary of placing too much trust in theories that we’ve never attempted to formalize. 

Olivia Guest gave a keynote that got me thinking about theory as an interface, which should be user-friendly in various ways. Patricia Rich’s keynote also got at this a bit, including the value of a theory that can be expressed visually. This reminded me of things I’ve read on the idea of cognitive and social values, in addition to epistemic values, in science, e.g., in work by Heather Douglas. 

Guest’s keynote also suggested that theory should be inclusive in the sense of not only being accessible or prescribed to by select parts of a community, and Rich’s talked about the importance of community in making theory and how well a theory diffuses power across a field. 

What it means for something to be a black box came up a few times, including that the meaning of a black box in psych has changed, from referring to something like behaviorism to becoming a bracketing device. There was a session about ML and/as/in science proposed by Mel Andrews, which I had to miss due to timing, but seemed related to things I’ve been thinking about lately. For instance there’s a question of how the kinds of claims generated in ML research should be taken as scientific statements subject to the usual criteria versus engineering descriptions, and whether it’s fair to think about using classes of modern ML e.g., deep learning as a type of atheoretical refuge, where the techniques don’t need to be fully understood or explained. Relatedly I wonder about the value of critiquing the absence of theory in ML, and what it means to strive toward more rigorous theory in fields focused on building tools where performance in target applications can be measured fairly directly, as opposed to fields geared toward description and explanation. Does having a powerful statistical learning method that lets you take any new evaluative criteria and train directly to optimize for that make more explicit theoretical frameworks, for instance describing data generation, more of a nice to have than a necessity?

There’s also the question of how theory evolves. I was reading Gelman and Shalizi at the same time as the workshop, and thinking about whether there’s an analogous trap to trying to quantify support for competing models in stats when it comes to how we think about theory progressing, where model expansion and checking might be better aims than direct comparison between competing theories. Somewhat relatedly, Artem Kaznatcheev’s keynote talked about how the mark of a field where theory can flourish is deep engagement with prior work, including extending it, synthetizing and unifying, and constructively challenging it.

After all this discussion, I feel less comfortable labeling certain types of work as atheoretical or blindly empirical, which were terms I used to use casually. Theory is everywhere, sometimes only implicitly, which is why we should be talking about it more.

Postdoc opportunity in San Francisco: methods research in public health

Angela Aidala points us to this ad:

Evidence for Action (E4A) is hiring a postdoc to work in our Methods Lab to help develop and share approaches to overcoming methodological challenges in research to advance health and racial equity. The postdoc should have training and interest in quantitative research methods and their application toward dismantling root causes of health inequities.

The individual will work on emerging topics that may focus on areas such as critical perspectives on quantitative methods in disparities research, data pooling to address small sample sizes, and development of measures relevant to advancing racial equity.

For full consideration, applications are due June 27, 2022. For more information visit https://e4action.org/PostDoc

They say:

The methods issues that arise in research on social conditions and health can be particularly difficult: pure randomization is rarely feasible, measurement is challenging, and causal pathways underlying effects are complex and cyclical.

Yup. They don’t explicitly mention Bayes/Stan, but it couldn’ thurt.

How can a top scientist be so confidently wrong? (R. A. Fisher and smoking example)

Robert Proctor points us to this lecture, “Why did Big Tobacco love (and fund) eugenicists like R.A. Fisher?” It’s pretty good! He talks about material from his 2011 book, Golden Holocaust: Origins of the Cigarette Catastrophe and the Case for Abolition, which we discussed here way back when: “If our product is harmful . . . we’ll stop making it.”

Below are some fun/horrifying slides from Proctor’s talk.

Here’s the always-charming R. A. Fisher disparaging experts by throwing around the “terrorist” label. I didn’t even know they called people terrorists that back in 1957! It looks like Fisher would fit in really well in modern academic psychology or the anti-vaccine movement:

And here’s some fun junk science (published in the Journal of the National Cancer Institute, no less!) pushing non-cigarette links to cancer:

Here’s the paper. It’s really bad! They attribute adult lung cancer to “the absence of vitamin A at birth.” It’s good to know that junk science is not a new thing. Lots of fun stuff in this paper, including the claim that, because the distribution of ages of lung cancer patients in the study is approximately normal, that “Mathematically, this means that only one factor is responsible for its origin; otherwise there would be more peaks.” Really this one deserves its own post.

Here’s a document from 1972 indicating that the cigarette company executives knew that the scientific jig was up, and it was only a matter of time before general opinion caught up with public health authorities regarding the dangers of smoking:

And here are a couple of creepy-ass dudes, one of whom was a literal Nazi and the other who has a Bacon number of 3:


Finally, a story in the National Enquirer planted by a cigarette company public relations firm:

How can a top scientist be so confidently wrong?

How did the great R. A. Fisher end up being so wrong about cigarettes and cancer? Proctor suspects that, unlike the cigarette executives (who were sitting on tons of evidence that smoking caused cancer), Fisher was not purposely spreading misinformation; rather, he was just naive. But then this just pushes the question back one step: How could this brilliant statistician be so naive?

I don’t have a full answer, but one factor may be politics. Fisher was very politically conservative, and, for whatever reason, cigarettes were a conservative cause. (On the other side were lots of far-left academics who believed in whatever far-left causes were going around.)

Beyond this, people make mistakes. Brilliance represents an upper bound on the quality of your reasoning, but there is no lower bound. The most brilliant scientist in the world can take really dumb stances. Indeed, the success that often goes with brilliance can encourage a blind stubbornness. Not always—some top scientists are admirably skeptical of their own ideas—but sometimes. And if you want to be stubborn, again, there’s no lower bound on how wrong you can be. The best driver in the world can still decide to turn the steering wheel and crash into a tree.

In the matter of smoking and cancer, Fisher comes off looking a bit worse—or maybe not so bad—as the run-of-the-mill academic hacks who cynically accepted $$$ from the cigarette companies, I guess on the theory that the money was going somewhere, so they might as well get some of it. Fisher wasn’t just taking the cash, he was also promulgating bad science.

As Proctor, Naomi Oreskes, and others have discussed, the “merchants of doubt” of the cigarette industry have used similar tactics in other scientific and policy areas, simultaneously promoting junk science while doing their best to degrade the credibility of scientific thinking more generally. I think work such as Proctor’s is valuable, even if you think he goes over the top sometimes, not just for the fun stories (but I do appreciate that) but also because it is through historical examples that we understand the world.

P.S. Interesting comment here from Paul Alper.

P.P.S. Gregory Mayer reports:

There’s a paper on this topic:

Stolley, P.D. 1991. When Genius Errs: R. A. Fisher and the Lung Cancer Controversy. American Journal of Epidemiology, Volume 133(5):416–425.

and a more recent online account;

Christopher, B. 2016. Why the Father of Modern Statistics Didn’t Believe Smoking Caused Cancer.

Both these articles are good. I particularly recommend Stolley. here are the reasons he gives for Fisher being so off on smoking and cancer:

– “Fisher sounds like a man with an axe to grind. . . . unwilling to seriously examine the data and to review all the evidence before him to try to reach a judicious conclusion.” For example, Stolley discusses Fisher’s unconventional theory of smoking and cancer and writes, “Fisher never produced any data or organized any study to follow up on this implausible hypothesis.”

– “Fisher was a smoker himself. Part of his resistance to seeing the association may have been rooted in his own fondness for smoking and his dislike of criticism of any part of his life.”

– “Fisher was upset by the public health response to the dangers of smoking not only because he felt that the supporting data were weak, but also due to his holding certain ideologic objections to mass public health campaigns.”

– “[Fisher] was good with data while working on one small set but was not easily able to integrate multiple or large data sets.” Stolley illustrates this point with a long quote from Yates and Mather, along with a detailed discussion of Fisher’s analyses of various small data sets where he picks at particular issues without looking at other relevant data or considering problems with the numbers he was looking at (for example, tables sent to him from a twin study for which no information is supplied about how the data were collected or recorded).

– “Perhaps another reason [Fisher] persisted in the cigarette controversy may have had to do with the circumstances of his life when he became engaged with the issue. He had just retired and was at loose ends. he received a lot of public attention for his views on lung cancer and fashioned a talk on the subject which he gave all over the world and which he arranged to have reprinted. It was well received, even by medical audiences.”

– “Fisher’s belief in the importance of genetics and his feeling that it is often neglected in medical research also influenced his views . . . His advancing the hypothesis as to the genetic predisposition to smoke in an effort to supplant the cigarette hypothesis reflects his intense interest in genetics.”

– “[Fisher] had wandered too far out of his field. He knew very little of the case-control method but was entirely suspicious of it because of the absence of his beloved randomization. Nor was he well informed about the ecologic data concerning smoking and cancer and, even when he comments on it, he gets it quite wrong. . . . He apparently made no attempt to put all the evidence together, to read all the older papers, to wonder if the disease was uncommon among populations that didn’t smoke. . . .”

Stolley summarizes:

There is a type of mind that is inclined to commit the fallacy of the possible proof; i.e., because an explanation is possible, it becomes somehow probable in their minds, particularly if they thought of it.

It is not surprising that Fisher and others questioned the association between smoking and lung cancer; controversy is fundamental in keeping scientific investigation from stagnating. What is startling is that someone of Fisher’s intellectual caliber would allow isolated evidence which coincided with his previously held views to blind hi to all else.

The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

This is Jessica. In a paper to appear at AIES 2022, Sayash Kapoor, Priyanka Nanayakkara, Arvind Narayanan, and Andrew and I write:

Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for greater integration of statistical approaches to causal inference and predictive modeling.

A deeper understanding of what reproducibility critiques in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective, and help researchers avoid “the worst of both worlds,” where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML.

Our results highlight where problems discussed across the two domains stem from similar types of oversights, including overreliance on theory, underspecification of learning goals, non-credible beliefs about real-world data generating processes, overconfidence based in conventional faith in certain procedures (e.g., randomization, test-train splits), and tendencies to reason dichotomously about empirical results. In both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often difficult to refute due to underspecification of the learning pipeline. We note how many of the errors recently discussed in ML expose the cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally
represent uncertainty in performance claims. At the same time, the goals of ML are inherently oriented toward addressing learning failures, suggesting that lessons about irreproducibility could be resolved through further methodological innovation in a way that seems unlikely in social psychology. This assumes, however, that ML researchers take concerns seriously and avoid overconfidence in attempts to reform. We conclude by discussing risks that arise when
sources of errors are misdiagnosed and the need to acknowledge the role that human inductive biases play in learning and reform.

As someone who has followed the replication crisis in social science for years and now sits in a computer science department where it’s virtually impossible to avoid engaging with the huge crushing bulldozer that is modern ML, I often find myself trying to make sense of ML methods and their limitations by comparison to estimation and explanatory modeling. At some point I started trying to organize these thoughts, then enlisted Sayash and Arvind, who had done some work on ML reproducibility, Priyanka who follows work on ML ethics and related topics, and Andrew as authority on empirical research failures. It was a good coming together of perspectives, and an excuse to read a lot of interesting critiques and foundational stuff on inference and prediction (we cite over 200 papers!) As a ten page conference style paper this was obviously ambitious, but the hope is that it will be helpful to others who have found themselves trying to understand how, if at all, these two sets of critiques relate. On some level I wrote it with computer science grad students in mind–I teach a course to first year PhDs where I talk a little about reproducibility problems in CS research and what’s unique compared to reproducibility issues in other fields, and they seem to find it helpful.

The term learning in the title is overloaded. By “errors in learning” here we are talking about not just problems with whatever the fitted models have inferred–we mean the combination of the model implications and the human interpretation of what we can learn from it, i.e., the scientific claims being made by researchers. We break down the comparison based on whether the problems are framed as stemming from data problems, model representation bias, model inference and evaluation problems, or bad communication.

table comparing concerns in ml versus psych

The types of data issues that get discussed are pretty different – small samples with high measurement error versus datasets that are too big to understand or document. The underrepresentation of subsets of the population to which the results are meant to generalize comes up in both fields, but with a lot more emphasis on implications for fairness in decision pipelines in ML based on its applied status. ML critics also talk about unique data issues like “harms of representation,” where model predictions reinforce some historical bias, like when you train a model to make admissions decisions based on past decisions that were biased against some group. The idea that there is no value-neutral approach to creating technology so we need to consider normative ethical stances is much less prevalent in mainstream psych reform, where most of the problems imply ways that modeling diverges from its ideal value-neutral status. There are some clearer analogies though if you look at concerns about overlooking sampling error and power issues in assessing the performance of an ML model.

Choosing representations and doing inference are also obviously different on the surface in ML versus psych, but here the parallels in critiques that reformers are making are kind of interesting. In ML there’s colloquially no need to think about the psychological plausibility of the solutions that a learner might produce; it’s more about finding the representation where the inductive bias, i.e., properties of the solutions that it finds, is desirable for the learning conditions. But if you consider all the work in recent years aimed at improving the robustness of models to adversarial manipulations to input data, which basically grew out of acknowledgment that perturbations of input data can throw a classifier off completely, it’s often implicit that successful learning means the model learns a function that seems plausible to a human. E.g., some of the original results motivating the need for adversarial robustness were surprising because they show that manipulations that a human doesn’t perceive as important (like slight noising of images or masking of parts that don’t seem crucial) can cause prediction failures. Simplicity bias in stochastic gradient descent can be cast as a bad thing when it causes a model to overrely on a small set of features (in the worst case, features that correlate with the correct labels as a result of biases in the input distribution, like background color or camera angle being strongly correlated with what object is in the picture). Some recent work explicitly argues that this kind of “shortcut learning” is bad because it defies expectations of a human who is likely to consider multiple attributes to do the same task (e.g., the size, color, and shape of the object). Another recent explanation is underspecification, which is related but more about how you can have many functions that achieve roughly the same performance given a standard test-validate-train approach but where the accuracy degrades at very different rates when you probe them along some dimension that a human thinks is important, like fairness. So we can’t really escape caring about how features of the solutions that are learned by a model compare to what we as humans consider valid ways to learn how to do the task.

We also compare model-based inference and evaluation across social psych and ML. In both fields, implicit optimization–for statistical significance in psych and better than SOTA performance in ML–is suggested to a big issue. However in contrast to using analytical solutions like MLE in psych, optimization is typically non-convex in ML such that the hyperparameters and initial conditions and computational budget you use in training the model can matter a lot. One problem critics point to is that in reporting researchers don’t always recognize this. How you define the baselines you test against is another source of variance and potentially bias if chosen in a way that improves your chances of beating SOTA.

In terms of high-level takeaways, we point out ways that claims are irrefutable by convention across the two fields. In ML research one could say there’s confusion about what’s a scientific claim and what’s an engineering artifact. When a paper claims to have achieved X% accuracy on YZ benchmark with some particular learning pipeline, this might be useful for other researchers to know when attempting progress on the same problem, but the results are more possibilistic than probabilistic, especially when based on only one possible configuration of hyperparameters etc and with an implicit goal of showing one’s method worked. The problem is that the claims are often stated more broadly, suggesting that certain innovations (a new training trick, a model type) led to better performance on a loosely defined learning task like ‘reading comprehension,’ ‘object recognition’, etc. In a field like social psych on the other hand you have a sort of inversion of NHST as intended, where a significant p-value leads to acceptance of loosely defined alternative hypotheses and subject samples are often chosen by convenience and underdescribed but claims imply learning something about people in general.

There’s also some interesting stuff related to how the two fields fail in different ways based on unrealistic expectations about reality. Meehl’s crud factor implies that using noisy measurements, small samples and misspecified models to argue about classes of interventions that have large predictable effects on some well-studied class of outcomes (e.g., political behavior) is out of touch with common sense about how we would expect multiple large effects to interact. In ML, the idea that we can leverage many weak predictors to make good predictions is accepted, but assumptions that distributions are stationary and that good predictive accuracy can stand alone as a measure of successful learning imply a similarly naive view of the world.

So… what can ML learn from the replication crisis in psych about fixing its problems? This is where our paper (intentionally) disappoints! Some researchers are proposing solutions to ML problems, ranging from fairly obvious steps like releasing all code and data to things like templates for reporting on limitations of datasets and behavior of models to suggestions of registered reports or pre-registration. Especially in an engineering community there’s a strong desire to propose fixes when a problem becomes apparent, and we had several reviewers that seemed to think the work was only really valuable if we made specific recommendations about what psych reform methods can be ported to ML. But instead the lesson we point out from the replication crisis is that if we ignore the various sources of uncertainty we face about how to reform a field—in how we identify problematic claims, how we define the core reasons for the problems, and how we know that a particular reform will more successful than others—it’s questionable whether we’re making real progressin reform. Wrapping up a pretty nuanced comparison with a few broad suggestions based on our instincts just didn’t feel right.

Ultimately this is the kind of paper that I’ll never feel is done to satisfaction, since there’s always some new way to look at it, or type of problem we didn’t include. There are also various parts where I think a more technical treatment would have been nice to relate the differences. But as I think Andrew has said on the blog, sometimes you have to accept you’ve done as much as you’re going to and move on from a project.

Why not look at Y?

In some versions of a “design-based” perspective on causal inference, the idea is to focus on how units are assigned to different treatments (i.e. exposures, actions), rather than focusing on a model for the outcomes. We may even want to prohibit loading, looking at, etc., anything about the outcome (Y) until we have settled on an estimator, which is often something simple like a difference-in-means or a weighted difference-in-means.

Taking a design-based perspective on a natural experiment, then, one would think about how Nature (or some other haphazard process) has caused units to be assigned to (or at least nudged, pushed, or encouraged into) treatments. Taking this seriously, identification, estimation, and inference shouldn’t be based on detailed features of the outcome or the researcher’s preference for, e.g., some parametric model for the outcome. (It is worth noting that common approaches to natural experiments, such as regression discontinuity designs, do in fact make central use of quantitative assumptions about the smoothness of the outcome. For a different approach, see this working paper.)

Taking a design-based perspective on an observational study (without a particular, observed source of random selection into treatments), one then considers whether it is plausible that, conditional on some observed covariates X, units are (at least as-if) randomized into treatments. Say, thinking of the Infant Health and Development Program (IHDP) example used in Regression and Other Stories, if we consider infants with identical zip code, sex, age, mother’s education, and birth weight, perhaps these infants are effectively randomized to treatment. We would assess the plausibility of this assumption — and our ability to employ estimators based on it (by, e.g., checking whether we have a large enough sample size and sufficient overlap to match on all these variables exactly) — without considering the outcome.

This general idea is expressed forcefully in Rubin (2008) “For objective causal inference, design trumps analysis”:

“observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data”

Randomized experiments “are automatically designed without access to any outcome data of any kind; again, a feature not entirely distinct from the previous reasons. In this sense, randomized experiments are ‘prospective.’ When implemented according to a proper protocol, there is no way to obtain an answer that systematically favors treatment over control, or vice versa.”

But why exactly? I think there are multiple somewhat distinct ideas here.

(1) If we are trying to think by analogy to a randomized experiment, we should be able assess the plausibility of our as-if random assumptions (i.e. selection on observables, conditional unconfoundedness, conditional exogeneity). Supposedly our approach is justified by these assumptions, so we shouldn’t sneak in, e.g., parametric assumptions about the outcome.

(2) We want to bind ourselves to an objective approach that doesn’t choose modeling assumptions to get a preferred result. Even if we aren’t trying to do so (as one might in a somewhat adversarial setting, like statisticians doing expert witness work), we know that once we enter the Garden of Forking Paths, we can’t know (or simply model) how we will adjust our analyses based on what we see from some initial results. (And, even if we only end up doing one analysis, frequentist inference needs to account for all the analyses we might have done had we gotten different results.) Perhaps there is really nothing special about causal inference or a design-based perspective here. Rather, we hope that as long as we don’t condition our choice of estimator on Y, we avoid a bunch of generic problems in data analysis and ensure that our statistical inference is straightforward (e.g., we do a z-test and believe in it).

So if (2) is not special to causal inference, then we just have to particularly watch out for (1).

But we often find we can’t match exactly on X. In one simple case, X might include some continuous variables. Also, we might find conditional unconfoundedness more plausible if we have a high-dimensional X, but this typically makes it unrealistic that we’ll find exact matches, even with a giant data set. So typical approaches relax things a bit. We don’t match exactly on all variables individually. We might match only on propensity scores, maybe splitting strata for many-to-many matching until we reach a stratification where there is no detectable imbalance. Or match after some coarsening, which often starts to look like a way to smuggle in outcome-modeling (even if some methodologists don’t want to call it that).

Thus, sometimes — perhaps in the cases where conditional unconfoundness is most plausible because we can theoretically condition on a high-dimensional X — we could really use some information about what covariates actually matter for the outcome. (This is because we need to deal with having finite, even if big, data.)

One solution is to use some sample splitting (perhaps with quite-specific pre-analysis plans). We could decide (ex ante) to use 1% of the outcome data to do feature selection, using this to prioritize which covariates to match on exactly (or close to it). For example, MALTS uses a split sample to learn a distance metric for subsequent matching. This seems like this can avoid the problems raised by (2). But nonetheless it involves bringing in quantitative information about the outcome.

Thus, while I like MALTS-style solutions (and we used MALTS in one of three studies of prosocial incentives in fitness tracking), it does seem like an important departure from a fully design-based “don’t make assumptions about the outcomes” perspective. But perhaps such a perspective is often misplaced in observational studies anyway — if we don’t have knowledge of what specific information was used by decision-makers in selection into treatments. And practically, with finite data, we have to make some kind of bias–variance tradeoff — and looking at Y can help us a bit with that.

[This post is by Dean Eckles.]

“Stylized Facts in the Social Sciences”

Sociologist Daniel Hirschman writes:

Stylized facts are empirical regularities in search of theoretical, causal explanations. Stylized facts are both positive claims (about what is in the world) and normative claims (about what merits scholarly attention). Much of canonical social science research can be usefully characterized as the production or contestation of stylized facts. Beyond their value as grist for the theoretical mill of social scientists, stylized facts also travel directly into the political arena. Drawing on three recent examples, I show how stylized facts can interact with existing folk causal theories to reconstitute political debates and how tensions in the operationalization of folk concepts drive contention around stylized fact claims.

Interesting. I heard the term “stylized facts” many years ago in conversations with political scientists—but from Hirschman’s article, I learned that the expression is most commonly used in economics, and it was originally used in a 1961 article by macroeconomist Nicholas Kaldor, who wrote:

Since facts, as recorded by statisticians, are always subject to numerous snags and qualifications, and for that reason are incapable of being accurately summarized, the theorist, in my view, should be free to start off with a ‘stylized’ view of the facts—i.e. concentrate on broad tendencies, ignoring individual detail, and proceed on the ‘as if’ method, i.e. construct a hypothesis that could account for these ‘stylized’ facts, without necessarily committing himself on the historical accuracy, or sufficiency, of the facts or tendencies thus summarized.

Hirschman writes:

“Stylized fact” is a term in widespread use in economics and is increasingly used in other social sciences as well. Thus, in some important sense, this article is an attempt to theorize a “folk” concept, with the relevant folk being social scientists themselves. . . . I argue that stylized facts should be understood as simple empirical regularities in need of explanation.

To me, this seems close, but not quite right. I agree with everything about this paragraph except for the last four words. A stylized fact can get explained but I think it remains a stylized fact, even though it is no longer in need of explanation. I’d say that, in social science jargon, a stylized fact in need of explanation is called a “puzzle.” Once the puzzle is figured out, it’s still a stylized fact.

But that’s just my impression. As Hirschman says, a term is defined by its use, and maybe the mainstream use of “stylized fact” is actually restricted to what I would call a puzzle or an unexplained stylized fact.

Why ask, “Why ask why?”?

In any case, beyond being a careful treatment of an interesting topic, Hirschman’s discussion interests me because it connects to a concern that Guido Imbens and I raised a few years ago regarding the following problem that we characterize as being typical of a lot of scientific reasoning:

Some anomaly is observed and it needs to be explained. The resolution of the anomaly may be an entirely new paradigm (Kuhn, 1970) or a reformulation of the existing state of knowledge (Lakatos, 1978). . . . We argue that a question such as “Why are there so many cancers in this place?” can be viewed not directly as a question of causal inference, but rather in- directly as an identification of a problem with an existing statistical model, motivating the development of more sophisticated statistical models that can directly address causation in terms of counterfactuals and potential outcomes.

In short, we say that science often proceeds by identifying stylized facts, which, when they cause us to ask “Why?”, represent anomalies that motivate further study. But in our article, Guido and I didn’t mention the term “stylized fact.” We situated our ideas within statistics, econometrics, and the philosophy of science. Hirschman takes this all a step further by connecting it to the practice of social science.

“Causal” is like “error term”: it’s what we say when we’re not trying to model the process.

After my talks at the University of North Carolina, Cindy Pang asked me a question regarding causal inference and spatial statistics: both topics are important in statistics but you don’t often see them together.

I brought up the classical example of agricultural studies, for example in which different levels of fertilizer are applied to different plots, the plots have a spatial structure (for example, laid out in rows and columns), and the fertilizers can spread through the soil to affect neighboring plots. This is known in causal inference as the spillover problem, and the way I’d recommend attacking it is to set up a parametric model for the spillover as a function of distance, which affects the level of fertilizer going to each plot, so that you could directly fit a model of the effect of the fertilizer on the outcome.

The discussion got me thinking about the way in which we use the term “causal inference” in statistics.

Consider some familiar applications of “causal inference”:
– Clinical trials of drugs
– Observational studies of policies
– Survey experiments in psychology.

And consider some examples of problems that are not traditionally labeled as “causal” but do actually involve the estimation of effects, in the sense of predicting outcomes under different initial conditions that can be set by the experimenter:
– Dosing in pharmacology
– Reconstructing climate from tree rings
– Item response and ideal-point models in psychometrics.

So here’s my thought: statisticians use the term “causal inference” when we’re not trying to model the process. Causal inference is for black boxes. Once we have a mechanistic model, it just feels like “modeling,” not like “causal inference.” Issues of causal identification still matter, and selection bias can still kill you, but typically once we have the model for the diffusion of fertilizer or whatever, we just fit the model, and it doesn’t seem like a causal inference problem, it’s just an inference problem. To put it another way, causal inference is all about the aggregation of individual effects into average effects, and if you have a direct model for individual effects, then you just fit it directly.

This post should have no effect in how we do any particular statistical analysis; it’s just a way to help us structure our thinking on these problems.

P.S. Just to clarify: In my view, all the examples above are causal inference problems. The point of this post is that only the first set of examples are typically labeled as “causal.” For example, I consider dosing models in pharmacology to be causal, but I don’t think this sort of problem is typically included in the “causal inference” category in the statistics or econometrics literature.

Jamaican me crazy: the return of the overestimated effect of early childhood intervention

A colleague sent me an email with the above title and the following content:

We were talking about Jamaica childhood intervention study. The Science paper on returns to the intervention 20 years later found a 25% increase but an earlier draft had reported a 42% increase. See here.

Well, it turns out the same authors are back in the stratosphere! In a Sept 2021 preprint, they report a 43% increase, but now 30 rather than 20 years after the intervention (see abstract). It seems to be the same dataset and they again appear to have a p-value right around the threshold (I think this is the 0.04 in the first row of Table 1 but I did not check super carefully).

Of course, no mention I could find of selection effects, the statistical significance filter, Type M errors, the winner’s curse or whatever term you want to use for it…

From the abstract of the new article:

We find large and statistically significant effects on income and schooling; the treatment group had 43% higher hourly wages and 37% higher earnings than the control group.

The usual focus of economists is earnings, not wages, so let’s go with that 37% number. It’s there in the second row of Table 1 of the new paper: the estimate is 0.37 with a standard error of . . . ummmm, it doesn’t give a standard error but it gives a t statistic of 1.68—

What????

B-b-b-but in the abstract it says this difference is “statistically significant”! I always thought that to be statistically significant the estimate had to be at least 2 standard errors from zero . . .

They have some discussion of some complicated nonparametric tests that they do, but if your headline number is only 1.68 standard errors away from zero, asymptotic theory is the least of your problems, buddy. Going through page 9 of the paper, it’s kind of amazing how much high-tech statistics and econometrics they’re throwing at this simple comparison.

Anyway, their estimate is 0.37 with standard error 0.37/1.68 = 0.22, so the 95% confidence interval is [0.37 +/- 2*0.22] = [-0.07, 0.81]. But it’s “statistically significant” cos the 1-sided p-value is 0.05. Whatever. I don’t really care about statistical significance anyway. It’s just kinda funny that, after all that effort, they had to punt on the p-value like that.

Going back to the 2014 paper, I came across this bit:

I guess that any p-value less than 0.10 is statistically significant. That’s fine; they should just get on the horn with the ORBITA stents people, because their study, when analyzed appropriately, ended up with p = 0.09, and that wasn’t considered statistically significant at all; it was considered evidence of no effect.

I guess the rule is that, if you’re lucky enough to get a result between 0.05 and 0.10, you get to pick the conclusion based on what you want to say: if you want to emphasize it, call it statistically significant; if not, call it non-significant. Or you can always fudge it by using a term like “suggestive.” In the above snippet they said the treatment “may have” improved skills and that treatment “is associated with” migration. I wonder if that phrasing was a concession to the fat p-value of 0.09. If the statistic had been at more conventionally attractive size 0.05 or below, maybe they would’ve felt free to break out the causal language.

But . . . it’s kind of funny for me to be riffing on p-values and statistical significance, given that I don’t even like p-values and statistical significance. I’m on record as saying that everything should be published and there should be no significance threshold. And I would not want to “threshold” any of this work either. Publish it all!

There are two places where I would diverge from these authors. The first is in their air of certainty. Rather than saying a “large and statistically significant effect” of 37%, I’d say an estimate of 37% with a standard error of 22%, or just give the 95% interval like they do in public health studies. JAMA would never let you get away with just giving the point estimate like that! Seeing this uncertainty tells you a few things: (a) the data are compatible (to use Sander Greenland’s term) with a null effect, (b) if the effect is positive, it could be all over the place, so it’s misleading as fiddlesticks to call it “a substantial increase” over a previous estimate of 25%, and (c) it’s empty to call this a “large” effect: with this big of a standard error, it would have to be “large” or it would not be “statistically significant.” To put it another way, instead of the impressive-sounding adjective “large” (which is clearly not necessarily the case, given that the confidence interval includes zero), it would be more accurate to use the less-impressive-sounding adjective “noisy.” Similarly, their statement, “Our results confirm large economic returns . . .”, seems a bit irresponsible given that their data are consistent with small or zero economic returns.

The second place I’d diverge from the authors is in the point estimate. They use a data summary of 37%. This is fine as a descriptive data summary, but if we’re talking policy, I’d like some estimate of treatment effect, which means I’d like to do some partial pooling with respect to some prior, and just about any reasonable prior will partially pool this estimate toward 0.

Ok, look. Lots of people don’t like Bayesian inference, and if you don’t want to use a prior, I can’t make you do it. But then you have to recognize that reporting the simple comparison, conditional on statistical significance (however you define it) will give you a biased estimate, as discussed on pages 17-18 of this article. Unfortunately, that article appeared in a psychology journal so you can’t expect a bunch of economists to have heard about it, but, hey, I’ve been blogging about this for years, nearly a decade, actually (see more here). Other people have written about this winner’s curse thing too. And I’ve sent a couple emails to the first author of the paper pointing out this bias issue. Anyway, my preference would be to give a Bayesian or regularized treatment effect estimator, but if you don’t want to do that, then at least report some estimate of the bias of the estimator that you are using. The good news is, the looser your significance threshold, the lower your bias!

But . . . it’s early childhood intervention! Don’t you care about the children???? you may ask. My response: I do care about the children, and early childhood intervention could be a great idea. It could be great even if it doesn’t raise adult earnings at all, or if it raises adult earnings by an amount that’s undetectable by this noisy study.

Think about it this way. Suppose the intervention has a true effect that it raises earnings by an average of 10%. That’s a big deal, maybe not so much for an individual, but an average effect of 10% is a lot. Consider that some people won’t be helped at all—that’s just how things go—so an average of 10% implies that some people would be helped a whole lot. Anyway, this is a study where the standard deviation of the estimated effect is 0.22, that is, 22%. If the average effect is 10% and the standard error is 22%, then the study has very low power, and it’s unlikely that a preregistered analysis would result in statistical significance, even at the 0.1 or 0.2 level or whatever it is that these folks are using. But, in this hypothetical world, the treatment would be awesome.

My point is, there’s no shame in admitting uncertainty! The point estimate is positive; that’s great There’s a lot of uncertainty, and the data are consistent with a small, tiny, zero, or even negative effect. That’s just the way things go when you have noisy data. As quantitative social scientists, we can (a) care about the kids, (b) recognize that this evaluation leaves us with lots of uncertainty, and (c) give this information to policymakers and let them take it from there. I feel no moral obligation to overstate the evidence, overestimate the effect size, and understate my uncertainty.

It’s so frustrating, how many prominent academics just can’t handle criticism. I guess they feel that they’re in the right and that all this stat stuff is just a bunch of paperwork. And in this case they’re doing the Lord’s work, saving the children, so anything goes. It’s the Armstrong principle over and over again.

And in this particular case, as my colleague points out, it is not just that they are not acknowledging or dealing with criticism in the prior paper, but here they are actively repeating the very same error with the very same study/data when they have been made aware of it on more than one occasion in this new paper and not acknowledging the issue at all. Makes me want to scream.

P.S. When asked whether I could share my colleague’s name, my colleague replied:

Regarding quoting me, do recall that I live in the midwest and have to walk across parking lots from time to time. So please do so anonymously.

Fair enough. I don’t want to get anybody hurt.