Relating t-statistics and the relative width of confidence intervals

How much does a statistically significant estimate tell us quantitatively? If you have an estimate that’s statistically distinguishable from zero with some t-statistic, what does that say about your confidence interval?

Perhaps most simply, with a t-statistic of 2, your 95% confidence intervals will nearly touch 0. That is, they’re just about 100% wide in each direction. So they cover everything from nothing (0%) to around double your estimate (200%).

More generally, for a 95% confidence interval (CI), 1.96/t — or let’s say 2/t — gives the relative half-width of the CI. So for an estimate with t=4, then everything from half your estimate to 150% of your estimate is in the 95% CI.

For other commonly-used nominal coverage rates, the confidence intervals have a width that is less conducive to a rule of thumb, since the critical value isn’t something nice like ~2. (For example, with 99% CIs, the Gaussian critical value is 2.58.) Let’s look at 90, 95, and 99% confidence intervals for t = 1.96, 3, 4, 5, and 6:

Confidence intervals on the scale of the estimate

You can see, for example, that even at t=5, the halved point estimate is still inside the 99% CI. Perhaps this helpfully highlights how much more precision you need to confidently state the size of an effect than just to reject the null.

These “relative” confidence intervals are just this smooth function of t (and thus the p-value), as displayed here:

confidence intervals on scale of the estimate by p-value and t-statistic

It is only when the statistical evidence against the null is overwhelming — “six sigma” overwhelming or more —that you’re also getting tight confidence intervals in relative terms. Among other things, this highlights that if you need to use your estimates quantitatively, rather than just to reject the null, default power analysis is going to be overoptimistic.

A caveat: All of this just considers standard confidence intervals based on normal theory labeled by their nominal coverage. Of course, many p < 0.05 estimates may have been arrived at by wandering through a garden of forking paths, or precisely because it passed a statistical significance filter. Then these CIs are not going to conditionally have their advertised coverage.

Getting the first stage wrong

Sometimes when you conduct (or read) a study you learn you’re wrong in interesting ways. Other times, maybe you’re wrong for less interesting reasons.

Being wrong about the “first stage” can be an example of the latter. Maybe you thought you had a neat natural experiment. Or you tried a randomized encouragement to an endogenous behavior of interest, but things didn’t go as you expected. I think there are some simple, uncontroversial cases here of being wrong in uninteresting ways, but also some trickier ones.

Not enough compliers

Perhaps the standard way to be wrong about the first stage is to think there is one when there more or less isn’t — when the thing that’s supposed to produce some random or as-good-as-random variation in a “treatment” (considered broadly) doesn’t actually do much of that.

Here’s an example from my own work. Some collaborators and I were interested in how setting fitness goals might affect physical activity and perhaps interact with other factors (e.g., social influence). We were working with a fitness tracker app, and we ran a randomized experiment where we sent new notifications to randomly assigned existing users’ phones encouraging them to set a goal. If you tapped the notification, it would take you to the flow for creating a goal.

One problem: Not many people interacted with the notifications and so there weren’t many “compliers” — people who created a goal when they wouldn’t have otherwise. So we were going to have a hopelessly weak first stage. (Note that this wasn’t necessarily weak in the sense of the “weak instruments” literature, which is generally concerned about a high-variance first stage producing bias and resulting inference problems. Rather, even if we knew exactly who the compliers were — compliers are a latent stratum, it was a small enough set of people that we’d have very low power for any of the plausible second-stage effects.)

So we dropped this project direction. Maybe there would have been a better way to encourage people to set goals, but we didn’t readily have one. Now this “file drawer” might mislead people about how much you can get people to act on push notifications, or the total effect of push notifications on our planned outcomes (e.g., fitness activities logged). But it isn’t really so misleading about the effect of goal setting on our planned outcomes. We just quit because we’d been wrong about the first stage — which, to a large extent, was a nuisance parameter here, and perhaps of interests to a smaller (or at least different, less academic) set of people.

We were wrong in a not-super-interesting way. Here’s another example from James Druckman:

A collaborator and I hoped to causally assess whether animus toward the other party affects issue opinions; we sought to do so by manipulating participants’ levels of contempt for the other party (e.g., making Democrats dislike Republicans more) to see if increased contempt led partisans to follow party cues more on issues. We piloted nine treatments we thought could prime out-party animus and every one failed (perhaps due to a ceiling effect). We concluded an experiment would not work for this test and instead kept searching for other possibilities…

Similarly, here the idea is that the randomized treatments weren’t themselves of primary interest, but were necessary for the experiment to be informative.

Now, I should note that, at least with a single instrument and a single endogenous variable, pre-testing for instrument strength in the same sample that would be used for estimation introduces bias. But it is also hard to imagine how empirical researchers are supposed to allocate their efforts if they don’t give up when there’s really not much of a first stage. (And some of these cases here are cases where the pre-testing is happening on a separate pilot sample. And, again, the relevant pre-testing here is not necessarily a test for bias due to “weak instruments”.)

Forecasting reduced form results vs. effect ratios

This summer I tried to forecast the results of the newly published randomized experiments conducted on Facebook and Instagram during the 2020 elections. One of these interventions, which I’ll focus on here, replaced the status quo ranking of content in users’ feeds with chronological ranking. I stated my forecasts for a kind of “reduced form” or intent-to-treat analysis. For example, I guessed what the effect of this ranking change would be on a survey measure of news knowledge. I said the effect would be to reduce Facebook respondents’ news knowledge by 0.02 standard deviations. The experiment ended up yielding a 95% CI of [-0.061, -0.008] SDs. Good for me.

On the other hand, I also predicted that dropping the optimized feed for a chronological one would substantially reduce Facebook use. I guessed it would reduce time spent by 8%. Here I was wrong, the reduction was more than double that, with what I roughly calculate to be a [-23%, -19%] CI.

OK, so you win some you lose some, right? I could even self-servingly say, hey, the more important questions here were about news knowledge, polarization etc., not exactly how much time people spend on Facebook.

It is a bit more complex than that because these two predictions were linked in my head: one was a kind of “first stage” for the other, and it was the first stage I got wrong.

Part of how I made that prediction for news knowledge was by reasoning that we have some existing evidence that using Facebook increases people’s news knowledge. For example, Allcott, Braghieri, Eichmeyer & Gentzkow (2020) paid people to deactivate Facebook for four weeks before the 2018 midterms. They estimate a somewhat noisy local average treatment effect of -0.12 SDs (SE: 0.05) on news knowledge. Then I figured my predicted 8% reduction probably especially “consumption” time (rather than time posting and interacting around one’s own posts), would translate into a much smaller 0.02 SD effect. I made some various informal adjustments, such as a bit of “Bayesian-ish” shrinkage towards zero.

So while maybe I got the ITT right, perhaps this is partially because I seemingly got something else wrong: the effect ratio of news knowledge over time spent (some people might call this an elasticity or semi-elasticity). Now I think it turns out here that the CI for news knowledge is pretty wide (especially if one adjusts for multiple comparisons), so even if, given the “first stage” effect, I should have predicted an effect over twice as large, the CI includes that too.

Effect ratios, without all the IV assumptions

Over a decade ago, Andrew wrote about “how to think about instrumental variables when you get confused”. I think there is some wisdom here. One of the key ideas is to focus on the first stage (FS) and what sometimes is called the reduced form or the ITT: the regression of the outcome on the instrument. This sidelines the ratio of the two, ITT/FS — the ratio that is the most basic IV estimator (i.e. the Wald estimator).

So why am I suggesting thinking about the effect ratio, aka the IV estimand? And I’m suggesting thinking about it in a setting where the exclusion restriction (i.e. complete mediation, whereby the randomized intervention only affects the outcome via the endogenous variable) is pretty implausible. In the example above, it is implausible that the only affect of changing feed ranking is to reduce time spent on Facebook, as if that was a homogenous bundle. Other results show that the switch to a chronological feed increased, for example, the fraction of subjects’ feeds that was political content, political news, and untrustworthy sources:

Figure 2 of Guess et al. showing effects on feed composition

Without those assumptions, this ratio can’t be interpreted as the effect of the endogenous exposure (assuming homogeneous effects) or a local average treatment effect. It’s just a ratio of two different effects of the random assignment. Sometimes in the causal inference literature there is discussion of this more agnostic parameter, labeled an “effect ratio” as I have done.

Does it make sense to focus on the effect ratio even when the exclusion restriction isn’t true?

Well in the case above, perhaps it makes sense because I used something like this ratio to produce my predictions. (But maybe this was or was not a sensible way to make predictions.)

Second, even if the exclusion restriction isn’t true, it can be that the effect ratio is more stable across the relevant interventions. It might be that the types of interventions being tried work via two intermediate exposures (A and B). If the interventions often affect them to somewhat similar degrees (perhaps we think about the differences among interventions being described by a first principal component that is approximately “strength”), then the ratio of the effect on the outcome and the effect on A can still be much more stable across interventions than the total effect on Y (which should vary a lot with that first principal component). A related idea is explored in the work on invariant prediction and anchor regression by Peter Bühlmann, Nicolai Meinshausen, Jonas Peters, and Dominik Rothenhäusler. That work encourages us to think about the goal of predicting outcomes under interventions somewhat like those we already have data on. This can be a reason to look at these effect ratios, even when we don’t believe we have complete mediation.

[This post is by Dean Eckles. Because this post touches on research on social media, I want to note that I have previously worked for Facebook and Twitter and received funding for research on COVID-19 and misinformation from Facebook/Meta. See my full disclosures here.]

Partisan assortativity in media consumption: Aggregation

How much do people consume news media that is mainly consumed by their co-partisans? And how do new media, including social media and their “dangerous algorithms“, contribute to this?

One way of measuring the degree of this partisan co-consumption of news is a measure used in the literature on segregation and, more recently, in media economics. Gentzkow & Shapiro (2011) used this isolation index to measure “ideological segregation”:
definition of ideological segregation
where j indexes what’s being consumed (e.g., Fox News, a particular new article) and m indexes the medium, for comparing, e.g., TV and radio. The second term in each summation (cons_j / visits_j) is the fraction of the visits of item j made by conservatives, or conservative share. Then you can think about what the average conservative share is for an individual, or for a group, such as all liberals or conservatives. This isolation index (which, with my network science glasses on I might call a measure of partisan assortativity) then measures the difference in the average conservative share between conservatives and liberals.

Using this measure and domain-level definitions of what is being consumed (e.g., nytimes.com), Gentzkow & Shapiro (2011) wrote that:

We find that ideological segregation of online news consumption is low in absolute terms, higher than the segregation of most offline news consumption, and significantly lower than the segregation of face-to-face interactions with neighbors, co-workers, or family members. We find no evidence that the Internet is becoming more segregated over time.

Aggregation questions

With this and many similar studies of news consumption, we might worry that partisans consume content from common outlets, but consume different content there. Wealthy NYC conservatives might read the New York Times for Bret Stephens and regional news, while liberals read it for different articles. And particularly when the same domain hosts a wide range of content that is not really under the same editorial banner, it might be hard to find a scalable, consistent way to choose what that “j” index should designate. Gentzkow & Shapiro already recognized some version of this problem, which led them to remove, e.g., blogspot.com (remember Blogger?) from their analysis.

This summer there were four new papers from large teams of academics and Meta researchers. I briefly discussed the three with randomized interventions, but neglected the fourth, “Asymmetric ideological segregation in exposure to political news on Facebook”. This paper presents new estimates of the isolation index above — and looks at how this index varies as you consider different kinds of relationships individuals could have to the news. What’s the isolation index for news users could see on Facebook given who they’re friends with and what groups they’ve joined? What’s the isolation index for news users see in their feeds? And what’s the isolation index for what they actually interact with? This follows the waterfall in Bakshy, Messing & Adamic (2015). This new paper concludes that “ideological segregation [on Facebook] is high and increases as we shift from potential exposure to actual exposure to engagement”, consistent with a what has sometimes been called a “filter bubble”.

In a letter to Science, Solomon Messing has some helpful comments on this new work, which prompted me to think about the aggregation issue I mention above, and which is perhaps one of the more common, persistent problems in social science. (We even just had a post about micro vs. macro quantities last week.) In a longer blog post, Solomon writes:

The issue is that while domain-level analysis suggests feed-ranking increases ideological segregation, URL-level analysis shows no difference in ideological segregation before and after feed-ranking. And we should strongly prefer their URL-level analysis. Domain-level analysis effectively mislabels highly partisan content as “moderate/mixed,” especially on websites like YouTube, Reddit, and Twitter.

This is pretty much all present in Figure 2. We can see (in panel A) how absolute levels of the isolation (segregation) index are much higher for URLs than domains. Then when we step through the funnel of what content people could see, do see, and interact with, there are big, qualitative differences between doing this at the level of domains (panel B) or URLs (panel C):

Figure 2A-C from González-Bailón et al. (2023) showing the segregation (or isolation) index.

Figure 2A-C from González-Bailón et al. (2023). There is a large difference in the isolation (segregation) index when measured at the level of individual articles/videos/etc. (URLs) vs. domains (panel A). This also seems to matter a lot for the differences among potential, exposed, and engaged audiences (compare panels B and C).

In his letter and associated longer blog post, Sol digs into the details a bit more, including highlighting some of the nice further analyses available in the large, information-rich appendices of the paper. That post and the reply by two of the authors (Sandra González-Bailón and David Lazer) both highlight additional heterogeneity in the gap between the “potential” and “exposed” segregation indices for different types of content and users:

Messing acknowledges that there is evidence of increased algorithmic segregation in content shared by users … Messing describes the size of these effects as “trace,” but the differences are substantively and statistically significant, as the confidence intervals around the time trends (based on a local polynomial regression) suggest. Messing states that there is no evidence of algorithmically driven increased segregation for Facebook groups, but the evidence suggests that algorithmic curation actually drives a very large and statistically significant reduction, rather than an increase, in segregation levels (figure S14C).

That is, depending on whether we are taking about URLs that could appear in users’ feeds because they were broadcast by users or pages (such as those representing businesses and publishers) or posted by users into groups (which can, e.g., be topically or regionally focused) — the differences in this isolation index between potential and exposed changes sign:

Figure S14 of Gonzáles-Bailón et al.

Figure S14 of Gonzáles-Bailón et al. (2023). These plots reproduce the analysis of Figure 2C above for content shared in different ways. Note that the ordering of potential and exposed reverses between A and B vs. C.

So the small differences in the segregation index between potential and exposed in the main analysis apparently mask larger differences in opposite directions.

González-Bailón and Lazer also argue we should attend to other measures of segregation in the original paper:

[Messing] overlooks Figure 2F. This panel shows that polarization (i.e., the extent to which the distribution of ideology scores is bimodal and far away from zero) goes up after algorithmic curation. In particular, the size of the homogeneous “bubble” on the ideological right grows when shifting from potential to exposed audiences. This is true both for URL- and domain- level analyses (Figure 2, E and F).

Gonzáles-Bailón et al. Figure 2F

Figure 2F from Gonzáles-Bailón et al. (2023). The distribution of the favorability scores of URLs for potential, exposed, and engaged audiences. The favorability score for a URL is (C-L)/(C+L) where C and L are the counts of conservatives and liberals in the audience.

There do seem to be some substantial differences in these distributions, which are presumably quite precisely estimated. I have a hard time telling what the differences in cumulative distributions might look like here. And it isn’t obvious to me what summaries of this distribution — capturing it being “bimodal and far away from zero” — are most relevant. This is one reason motivation for using preregistered quantities like the isolation index.

Is less aggregated better in every way?

More generally, I think it is relevant to ask whether we should always prefer the disaggregated analyses.

There are perhaps two separate questions here. First, ignoring estimation error, is what we want to know the “fully” (whatever exactly that means) disaggregated quantity? Second, in practice, working with finite data, when should we aggregate a bit to perhaps get better estimates?

Let’s take the estimation case first. Imagine we aren’t working with the view of things Meta has. We might observe a small sample of people. Then if we try to keep track of what fraction of the viewers of each article are liberals or conservatives, we are going to get a lot of 0s and 1s. So just a lot of noise. This is not classical measurement error and could add substantial bias. This might suggest aggregating a bit. (Of course, one “nice” thing about this bias is that perhaps we can readily characterize it analytically and correct for it, unlike the problems with ecological inference.) Maybe there is even a nice solution here involving a multilevel model, with, e.g., URLs nested within domains, with domain specific means and variances.

OK, now ignoring estimation with finite data: what quantity do we want to know in the first place? We can think of how partisans may consume different parts of the same content: Democrats and Republicans might read or watch different bits of the same State of the Union address. We could even pick this up in some data. For example, some consumer panels could allow us to measure exactly which parts of a YouTube video someone watches. (And, URLs can even point to specific parts of the same video — or, I’d assume more commonly, to a different YouTube video that is a clip of the same longer video.) One reason to chose to keep doing analysis at the video/URL level would be that perhaps many other relevant things are happening at the level of videos or above: Ads may be quite homogeneously targeted within sections of the same video, and the revenue is similarly shared (or not) with the channel owner. Thus, one motivation for choosing some more aggregated analysis would be addressing questions about economics of journalism, competition in Internet services and media, etc.

In this setting, I find the URL-level analysis more compelling for addressing questions about “filter bubbles” etc. — though there are still threats to relevance or validity here too. But with less data or more concern about media economics, we might want to attend to more aggregated analyses.

[This post is by Dean Eckles. I want to note that I was an employee or contractor of Facebook (now Meta) from 2010 through 2017. I have received funding for other research from Meta, Meta has sponsored a conference I organize, and I have coauthored with Meta employees as recently as this summer. I was also recently a consultant to Twitter, ending shortly after the Musk acquisition. You can find all my disclosures here.]

 

thefacebook and mental health trends: Harvard and Suffolk County Community College

Multiple available measures indicate worsening mental health among US teenagers. Prominent researchers, commentators, and news sources have attributed this to effects of information and communication technologies (while not always being consistent on exactly which technologies or uses thereof). For example, John Burn-Murdoch at the Financial Times argues that the evidence “mounts” and he (or at least his headline writer) says that “evidence of the catastrophic effects of increased screen-time is now overwhelming”. I couldn’t help but be reminded of Andrew’s comments (e.g.) on how Daniel Kahneman once summarized the evidence about social priming in his book Thinking, Fast and Slow: “[D]isbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Like the social priming literature, much of the evidence here is similarly weak, but mainly in different (perhaps more obvious?) ways. There is frequent use of plots of aggregate time series with a vertical line indicating when some technology was introduced (or maybe just became widely-enough used in some ad hoc sense). Much of the more quantitative evidence is cross-sectional analysis of surveys, with hopeless confounding and many forking paths.

Especially against the backdrop of the poor methodological quality of much of the headline-grabbing work in this area, there are a few studies that stand out as having research designs that may permit useful and causal inferences. These do indeed deserve our attention. One of these is the ambitiously-titled “Social media and mental health” by Luca Braghieri, Ro’ee Levy, and Alexey Makarin. Among other things, this paper was cited by the US Surgeon General’s advisory about social media and youth mental health.

Here “social media” is thefacebook (as Facebook was known until August 2006), a service for college students that had some familiar features of current social media (e.g., profiles, friending) but lacked many other familiar features (e.g., a feed of content, general photo sharing). The study cleverly links the rollout of thefacebook across college campuses in the US with data from a long running survey of college students (ACHA’s National College Health Assessment) that includes a number of questions related to mental health. One can then compare changes in survey respondents’ answers during the same period across schools where thefacebook is introduced at different times. Because thefacebook was rapidly adopted and initially only had within-school functionality, perhaps this study can address the challenging social spillovers ostensibly involved in effects of social media.

Staggered rollout and diff-in-diff

This is commonly called a differences-in-differences (diff-in-diff, DID) approach because in the simplest cases (with just two time periods) one is computing differences between units (those that get treated and those that don’t) in differences between time periods. Maybe staggered adoption (or staggered introduction or rollout) is a better term, as it describes the actual design (how units come to be treated), rather than a specific parametric analysis.

Diff-in-diff analyses are typically justified by assuming “parallel trends” — that the additive changes in the mean outcomes would have been the same across all groups defined by when they actually got treatment.

This is not an assumption about the design, though it could follow from one — such as the obviously very strong assumption that units are randomized to treatment timing — but rather directly about the outcomes. If the assumption is true for untransformed outcomes, it typically won’t be true for, say, log-transformed outcomes, or some dichotomization of the outcome. That is, we’ve assumed that the time-invariant unobservables enter additively (parallel trends). Paul Rosenbaum emphasizes this point when writing about these setups, describing them as uses of “non-equivalent controls” (consistent with a longer tradition, e.g., Cook & Campbell).

Consider the following different variations on the simple two-period case, where some units get treated in the second period:

Three stylized differences-in-differences scenarios

Assume for a moment that traditional standard errors are tiny. In which of these situations can we most credibly say the treatment caused an increase in the outcomes?

From the perspective of a DID analysis, they basically all look the same, since we assume we can subtract off baseline differences. But, with Rosenbaum, I think it is reasonable to think that credibility is decreasing from left to right, or at least the left panel is the most credible. There we have a control group that pre-rollout looks quite similar, at least in the mean outcome, to the group that goes on to be treated. We are precisely not leaning on the double differencing — not as obviously leaning on the additivity assumption. On the other hand, if the baseline levels of the outcome are quite different, it is perhaps more of leap to assume that we can account for this by simply subtracting off this difference. If the groups already look different, why should they change so similarly? Or maybe there is some sense in which they are changing similarly, but perhaps they are changing similarly in, e.g., a multiplicative rather than additive way. Ending up with a treatment effect estimate on the same order as the baseline difference should perhaps be humbling.

How does this relate to Braghieri, Levy & Makarin’s study of thefacebook?

Strategic rollout of thefacebook

The rollout of thefacebook started with Harvard and then moved to other Ivy League and elite universities. It continued with other colleges and eventually became available to students at numerous colleges and community colleges.

This rollout was strategic in multiple ways. First, why not launch everywhere at once? There was some school-specific work to be done. But perhaps more importantly, the leading social network service (Friendster), had spent much of the prior year being overwhelmed by traffic to the point of being unusable. Facebook co-founder Dustin Moskowitz said, “We were really worried we would be another Friendster.”

Second, the rollout worked through existing hierarchies and competitive strategy. The idea that campus facebooks (physical directories with photos distributed to students) should be digital was in the air in the Ivy League in 2003, so competition was likely to emerge, especially after thefacebook’s early success. My understanding is that thefacebook prioritized launching wherever they got wind of possible competition. Later, as this became routinized and after infusion of cash from Peter Thiel and others, thefacebook was able to launch at many more schools.

Let’s look at the dates of the introduction of thefacebook used in this study:

Here the colors indicate the different semesters used to distinguish the four “expansion groups” in the study. There are so many schools with simultaneous launches, especially later on, that I’ve only plotted every 12th school with a larger point and its name. While there is a lot of within-semester variation in the rollout timing, unfortunately the authors cannot use that because of school-level privacy concerns from ACHA. So the comparisons are based on comparing subsets of these four groups.

Reliance on comparisons of students at elite universities and community colleges

Do these four groups seem importantly different? Certainly they are very different institutions with quite different mixes of students. They differ in more than age, gender, race, and being an international student, which many of the analyses use regression to adjust for. Do the differences among these groups of students matter for assessing effects of thefacebook on mental health?

As the authors note, there are baseline differences between them (Table A.2), including in the key mental health index. The first expansion group in particular looks quite different, with already higher levels of poor mental health. This baseline difference is not small — it is around the same size as the authors’ preferred estimate of treatment effects:

Comparison of baseline differences between expansion groups and the preferred estimates of treatment effects

This plot compares the relative magnitude of the baseline differences (versus the last expansion group) to the estimated treatment effects (the authors’ preferred estimate of 0.085). The first-versus-fourth comparison in particular stands out. I don’t think this is post hoc data dredging on my part, knowing what we do about these institutions and this rollout: these are students we ex ante expect to be most different; these groups also differ on various characteristics besides the outcome. This comparison is particularly important because it should yield two semesters of data where one group has been treated and the other hasn’t, whereas, e.g., comparing groups 2 and 3 basically just gives you comparisons during fall 2004, during which there is also a bunch of measurement error in whether thefacebook has really rollout out yet or not. So much of the “clean” exposed vs. not yet comparisons rely on including these first and last groups.

It turns out that one needs both the first and the last (fourth) expansion groups in the analysis to find statistically significant estimates for effects on mental health. In Table A.13, the authors helpfully report their preferred analysis dropping one group at a time. Dropping either group 1 or 4 means the estimate does not reach conventional levels for statistical significance. Dropping group 1 lowers the point estimate to 0.059 (SE of 0.040), though my guess is that a Wu–Hausman-style analysis would retain the null that these two regressions estimate the same quantity (which the authors concurred on). (Here we’re all watching out for not presuming that the difference between stat. sig. and not is itself stat. sig.)

One way of putting this is that this study has to rely on comparisons between survey respondents at schools like Harvard and Duke, on the one hand, and a range of community colleges on the other — while maintaining the assumption that in the absence of thefacebook’s launch they would have the same additive changes in this mental health index over this period. Meanwhile, we know that the students at, e.g., Harvard and Duke have higher baseline levels of this index of poor mental health. This may reflect overall differences in baseline risks of mental illness, which then we would expect to continue to evolve in different ways (i.e., not necessarily in parallel, additively). We also can expect they were getting various other time-varying exposures, including greater adoption of other Internet services.

Summing up

I don’t find it implausible that thefacebook or present-day social media could affect mental health. But I am not particularly convinced that analyses discussed here provide strong evidence about the effects of thefacebook (or social media in general) on mental health. This is for the reasons I’ve given — they rely on pooling data from very different schools and students who substantially differ in the outcome already in 2000–2003 — and others that maybe I’ll return to.

However, this study represents a comparatively promising general approach to studying effects of social media, particularly in comparison to much of the broader literature. For example, by studying this rollout among dense groups of eventual adopters, it can account for spillovers of peers’ use in ways neglected in other studies.

I hope it is clear that I take this study seriously and think the authors have made some impressive efforts here. And my ability to offer some of these specific criticisms depends on the rich set of tables they have provided, even if I wish we got more plots of the raw trends broken out by expansion group and student demographics.

I also want to note there is another family of analyses in the paper (looking at students within the same schools who have been exposed to different numbers of semesters of thefacebook being present) that I haven’t addressed and which correspond to a somewhat different research design — which aims to avoid some of the threats to validity I’ve highlighted, though it has others. This is less typical research design, it is not featured prominently in the paper. Perhaps this will be worth returning to.

P.S. In response to a draft version of this post, Luca Braghieri, Ro’ee Levy, and Alexey Makarin noted that excluding the first expansion group could also lead to downward bias in estimation of average effects, as (a) some of their analysis suggests larger effects for students with demographic characteristics indicating higher baseline risk of mental illness and (b) if the effects are increasing with exposure duration (as some analyses suggest), which the first group gets more of. If the goal is estimating a particular, externally valid quantity, I could agree with this. But my concern is more over the internal validity of these causal inferences (really we would be happy with a credible estimate of the causal effects of pretty much any convenient subset of these schools). There if we think the first group has higher baseline risk, we should be more worried about the parallel trends assumption.

[This post is by Dean Eckles. Thanks to the authors (Luca Braghieri, Ro’ee Levy, and Alexey Makarin), Tom Cunningham, Andrey Fradkin, Solomon Messing, and Johan Ugander for their comments on a draft of this post. Thanks to Jonathan Roth for a comment that led me to edit “not [as obviously] leaning on the additivity assumption” above to clarify unit-level additivity assumptions may still be needed to justify diff-in-diff even when baseline means match. Because this post is about social media, I want to note that I have previously worked for Facebook and Twitter and received funding for research on COVID-19 and misinformation from Facebook/Meta. See my full disclosures here.]

Confusions about inference, prediction, and “probability of superiority”

People sometimes confuse certainty about summary statistics with certainty about draws from the distributions they summarize. Saying that we are quite confident that the average outcome for one group is higher than the average for the other can be taken as a claim about the full distributions of the outcomes. And intuitions people might have about the relationship between the two from settings they know well are quickly broken when considering other settings (e.g., much larger sample sizes, outcomes with measured with greater coarseness).

Sam Zhang, Patrick Heck, Michelle Meyer, Christopher Chabris, Daniel Goldstein, and Jake Hofman study this confusion in their recently published paper. Among other things, by conducting these studies with samples of data scientists, faculty, and medical professionals, this new work highlights that this is a confusion that experts seem to make as well, thereby building on prior work with laypeople by a overlapping set of authors (Jake Hofman, Dan Goldstein, and Jessica Hullman), which has been discussed here previously. They also encourage, as often comes up here, plotting the data:

[T]he pervasive focus on inferential uncertainty in scientific data visualizations can mislead even experts about the size and importance of scientific findings, leaving them with the impression that effects are larger than they actually are. … Fortunately, we have identified a straightforward solution to this problem: when possible, visually display both outcome variability and inferential uncertainty by plotting individual data points alongside statistical estimates.

One way of plotting the data (alongside means and associated inference) is shown in their Figure 1:

Figure 1 from Zhang et al. showing different visualization of two synthetic data sets.

So I like the admonition here to plot the data, or at least the distribution. Perhaps this also functions as a nice encouragement for researchers to look at these distributions, which apparently is not as common as one might think. Overall, I agree that there is clear evidence that people, including experts, mistake inferential certainty (about means and mean effects) for predictive certainty.

Here what I want to probe is one of their measures of predictive uncertainty, which maybe questions exactly what quantifying predictive uncertainty in the context of causal inference and decision making is good for.

“Probability of superiority”

These studies quantify predictive certainty in multiple ways. One of them is having participants specify a histogram of outcomes for patients in the different groups, using an implementation of Distribution Builder. But these also have participants estimate the “probability of superiority”. The first paper describes this as: “the probability that undergoing the treatment provides a benefit over a control condition” (Hofman, Goldstein, Hullman, 2020, p. 3).

This isn’t quite right as a description of this quantity, except under additional strong assumptions — assumptions that are certainly false in major ways in all the behavioral and social science applications that come to mind. (I want to note here also that this misuse appears to be common, so this is not at all an error specific to this earlier work by these authors; see below for more examples.)

“Probability of superiority” (or PoS; in some other work given other names as well, like “common language effect size”) is defined as the probability that a sample from one distribution is larger than a sample from another, usually treating ties as broken randomly (so counted as 0.5). It is a label for a scaled version of the U-statistic of the Mann–Whitney U test / Wilcoxon rank-sum test. So it is correct to say that it is the probability that a random patient assigned to treatment has a better outcome than a random patient assigned to control. But this may tell us precious little the distribution of treatment effects, which is related to what is often called the fundamental problem of causal inference.

First, even in a very simple setup, it is possible to get very small values for PoS while in fact everyone (100%) is benefitting from treatment. To see this and other points here, it is useful to think in terms of potential outcomes, where Yi(0) and Yi(1) are the outcomes for unit i if they were assigned to treatment and control respectively. Simply posit a constant treatment effect τ, so that Yi(1) = Yi(0) + τ. Then if τ > 0, everyone benefits from treatment. However, it is possible to have a PoS arbitrarily close to 0 by changing the distribution of Yi(0). Now PoS still does say something about the distributions of Yi(1) and Yi(0), but not much about their joint distribution. Even short of this exactly additive treatment effect model, we usually think that there is a lot of common variation, such that Yi(0) and Yi(1) are positively correlated (even if not perfectly so, as with a homogeneous additive effect).

I think some of the confusion here arises from thinking of PoS as Pr(Yi(1) > Yi(0)), when really one needs to drop the indices or treat them differently, decoupling them. Maybe it is helpful to remember that PoS is just a function of the two marginal distributions of Yi(0) and Yi(1).

These problems can get more severe, including allowing reversals, if there are heterogeneous effects of treatment. Hand (1992) points out that Pr(Yi(1) > Yi(0)) can be very different than Pr(Yj(1) > Yk(0)), presenting this simple artificial example. Let (Yi(0), Yi(1)) have equal probability on (5, 0), (1, 2), and (3, 4). Pr(Yj(1) > Yk(0)) = 1/3, so PoS says we should prefer control. But Pr(Yi(1) > Yi(0)) = 2/3: the majority of units have positive treatment effects.

So PoS can be quite a poor guide to decisions. Fun, trickier problems can also arise, as PoS is also intransitive.

To some degree, the problem here is just that PoS can appear to offer something that is basically impossible: a totally nonparametric way to quantify effect sizes for decision-making. Thomas Lumley explains:

Suppose you have a treatment that makes some people better and other people worse, and you can’t work out in advance which people will benefit. Is this a good treatment? The answer has to depend on the tradeoffs: how much worse and how much better, not just on how many people are in each group.

If you have a way of making the decision that doesn’t explicitly evaluate the tradeoffs, it can’t possibly be right. The rank tests make the tradeoffs in a way that changes depending on what treatment you’re comparing to, and one extreme consequence is that they can be non-transitive. Much more often, though, they can just be misleading.

It’s possible to prove that every transitive test reduces each sample to a single number and then compares those numbers [equivalent to Debreu’s theorem in utility theory]. That is, if you want an internally consistent ordering over all possible results of your experiment, you can’t escape assigning numerical scores to each observation.

Overall, this leads to my conclusion that, at least for most purposes related to evaluating treatments, PoS is not recommended. In their new paper, Zhang et al. do continue using PoS, but they also no longer give it the definition above, at least explicitly avoiding this misunderstanding. It is interesting to think about how to recast the general phenomenon they are studying in a way that more forcefully avoids this potential confusion. It is not obvious to me that a standard paradigm of treatment choice or willingness-to-pay for treatment involves the need to account for this predictive uncertainty.

PoS and AUC

Does PoS have some sensible uses here?

I want to highlight one point that Dan Goldstein made last year: “Teachers, principals, small town mayors are reading about treatments with tiny effect sizes and thinking they’ll have a visible effect in their organizations”.

Dan intended this as a comment on the need for intuitions about statistical power in planning field studies, but here’s what it made me think: Sometimes people are deciding whether to implement some intervention. It might be costly, including that they are in some sense spending social capital. They might also be deciding how prominently to announce their decision. It is then going to be important for them whether their unit’s outcome will be better than the outcomes of some comparison units (e.g., nearby classrooms, schools — or recent classroom–years or school–years) where it was absent. Maybe PoS tells them something about this. They aren’t trying to do power calculations exactly, but they are trying to answer the question: If I do this (and perhaps advertise I’m doing this thing), are my outcomes going to look comparatively good?

This also fits with the artificial choice setting the first paper gave participants, where participants are giving their willingness to pay for something that could improve their time in a race, but they should only care about winning the race. (Of course, one still might worry about that, in a race, there is shared variance from, e.g., wind, so a PoS computed from unpaired outcomes will be misleading. Similarly, there are common factors affecting two classrooms in the same school.)

But maybe PoS is useful in that kind of a setting. This makes sense given that PoS is just the area under the curve (AUC) for a sequence of classifiers that threshold the outcomes to guess the label (in our examples, treatment or control). This highlights that PoS is perhaps most useful in the opposite direction of the main way it is promoted under that label (as opposed to the AUC label): You want to say something about how much treatment observations stand out compared with control observations. Perhaps only rarely (e.g., the example in the previous paragraph) does this provide the information you want to choose treatments, but it is useful in other ways.

Inference, then perhaps prediction

One interesting observation is that in the central example used in the studies with data scientists and faculty, the real-world inference is itself quite uncertain. The task is adapted from a study that ostensibly provided evidence that exposure to violent video games causes aggressive behavior in a subsequent reaction time task (in particular, subjecting others to louder/longer noise, after they have done the same). The original result in that paper is:

Most importantly, participants who had played Wolfenstein 3D delivered significantly longer noise blasts after lose trials than those who had played the nonviolent game Myst (Ms = 6.81 and 6.65), F(1, 187) = 4.82, p < .05, MSE = .27. In other words, playing a violent video game increased the aggressiveness of participants after they had been provoked by their opponent’s noise blast.

bar graph of results

Figure 6 of Anderson & Dill (2000): “Main effects of video game and trait irritability on aggression (log duration) after “Lose” trials, Study 2.”

Hmm what is that p-value there? Ah p = 0.03. Particularly given forking paths (there were no stat. sig. effects for noise loudness, and this result is only for some trials) and research practices in psychology over 20 years ago, I think it is reasonable to wonder whether there is much evidence here at all. (Here is some discussion of this broader body of evidence by Joe Hilgard.)

As for that plot, I can, with Zhang et al., agree maybe that some other way of visualizing these results might have better conveyed the (various) sources of uncertainty we have here.

Researchers and other readers of the empirical literature are often in the situation of trying to understand whether there is much basis for inference about treatment effects at all. In this case, we barely have enough data to possibly conclude there’s some weak evidence of any difference between treatment and control. We’re going to have a hard time saying anything really about the scale of this effect, whether measured in the different in means or PoS.

Maybe things are changing. There are “changing standards within experimental psychology around statistical power and sample sizes” (SI). So perhaps there is room, given greater inferential certainty, for measures of predictability of outcomes to become more relevant in the context of randomized experiments. However, I would caution that rote use of quantities like PoS — which really has a very weak relationship with anything relevant to, e.g., willingness-to-pay for a treatment — may spawn new, or newly widespread, confusions.

What uses for PoS in understanding treatment effects and making decisions have I missed?


[This post is by Dean Eckles. Thanks to Jake Hofman and Dan Goldstein for responding with helpful comments to a draft.]

P.S.: Other examples of confusion about PoS in the literature

Here’s an example from a paper directly about PoS and advocating its use:

An estimate of [PoS] may be easier to understand than d or r, especially for those with little or no statistical expertise… For example, rather than estimating a health benefit in within-group SD units or as a correlation with group membership, one can estimate the probability of better health with treatment than without it. (Ruscio & Mullen, 2012)

In other cases, things are written with a fundamental ambiguity:

For example, when one is comparing a treatment group with a control group, [PoS] estimates the probability that someone who receives the treatment would fare better than someone who does not. (Ruscio, 2008)

Faculty position in computation & politics at MIT

We have this tenure-track Assistant Professor position open at MIT. It is an unusual opportunity in being a shared position between the Department of Political Science and the College of Computing. (I say “unusual” compared with typical faculty lines, but by now MIT has hired faculty into several such shared positions.)

So we’re definitely inviting applications not just from social science PhDs, but also from, e.g., statisticians, mathematicians, and computer scientists:

We seek candidates whose research involves development and/or intensive use of computational and/or statistical methodologies, aimed at addressing substantive questions in political science.

Beyond advertising this specific position, perhaps this is an interesting example of the institutional forms that interdisciplinary hiring can take. Here the appointment would be in the Department of Political Science and then also within one of the relevant units of the College of Computing. And there are two search committees working together, one from the Department and one from the College. I am serving on the latter, which includes experts from all parts of the College.

[This post is by Dean Eckles.]

Mundane corrections to the dishonesty literature

There is a good deal of coverage of the more shocking reasons that papers on the psychology of dishonesty by Dan Ariely and Francesca Gino need to be corrected or retracted. I thought I’d share a more mundane example — in this same literature, and in fact in the very series of papers.

There is no allegation of further fraud here, the errors are mundane, but maybe this is relevant to challenges in correcting the scientific record, etc.

Back in August 2021, Data Colada published the initial evidence of fraud in the field experiment in Shu, Mazar, Gino, Ariely & Bazerman (2012). They were able to do this because Kristal, Whillans, Bazerman, Gino, Shu, Mazar & Ariely (2020), which primarily reported failures to replicate the original lab experimental results, also reported some problems with the field experimental data (covariate imbalance inconsistent with randomization) and shared the spreadsheet with this data.

So I clicked through to the newer (2020) paper to check out the results. I came across this paragraph, reporting the main results from the preregistered direct replication (Study 6):

We failed to detect an effect of signing first on all three preregistered outcomes (percent of people cheating per condition, t[1,232.8] = −1.50, P = 0.8942, d = −0.07 95% confidence interval [CI] [−1.96, 0.976]; amount of cheating per condition, t[1,229.3] = −0.717, P = 0.7633, d = −0.04 95% CI[−1.96, 0.976]; and amount of expenses reported, t[1,208.9] = −1.099, P = 0.864, d = −0.06 95% CI[−1.96, 0.976]). The Bayes factors for these three outcome measures were between 7.7 and 12.5, revealing substantial support for the null hypothesis (6). This laboratory experiment provides the strongest evidence to date that signing first does not encourage honest reporting.

A couple things jumped out here. First, this text says the point estimate for the effect of signing at the top on amount of cheating is d = −0.04, but Figure 1 in the paper says it is d = 0.04:

Figure 1 of Kristal et al. (2020)

Figure 1 of Kristal et al. (2020), where Study 6 is the pre-registered direct replication. [Update: This apparently has the wrong sign for all of the estimates here.]

So somehow the sign got switched somewhere.

Second, if you look at that paragraph again, there are some unusual things going on with the confidence intervals. They are all the same and aren’t really on the right scale or centered anywhere near the point estimates. In fact, it seems like it seems like a critical value (which would be ±1.96 for a z-test) and a cumulative fraction (which would be .025 and .975) got accidentally reported as the lower and upper ends of the 95% intervals. I imagine this  could happen if doing these calculations in a spreadsheet.

So in August 2021 I emailed the first author and Francesco Gino to report that something was wrong here, concluding by saying: “Seems like this is just a reporting error, but I can imagine this might create even more confusion if not corrected.”

Professor Gino thanked me for bringing this to their attention. I followed up in October 2021 to provide more detail about my concerns about the CIs and ask:

This line of work came up the other day, and this prompted me to check on this and noticed there hasn’t been a correction issued, at least that I saw. Is that in the works?

First author Ariella Kristal helpfully immediately responded with [see update below] their understanding of the errors at the time (that the correct point estimate is positive, d = 0.04), and said a correction had not yet been submitted but they were “hoping to issue the correction ASAP”. OK, these things can take a little time — obviously important to make these corrections with care!

But still I was a bit disappointed when, in February 2022, I noticed that there was not yet any correction to the paper. So I emailed the editorial team at PPNAS, where this paper was published, writing in part:

I notified the authors of these problems in August.

I’m wondering if there is any progress on getting this article corrected? Have the authors requested it be corrected? (Their earlier response to me was somewhat ambiguous about whether PNAS had been contacted by them yet.)

I’m a bit surprised nothing visible has happened despite the passage of six months.

Staff confirmed then that a correction had been requested in October, but that the matter was still under review. (In retrospect, I can now wonder whether perhaps by this point this had become tied up in broader concerns about papers by Gino.)

In September 2022, with over a year passed since my initial email to the authors, I thought I should at least post a comment on PubPeer, so other readers might find some documentation of this issue.

As of writing this post, there is still no public notice of any existing or pending correction to “Signing at the beginning versus at the end does not decrease dishonesty”.

Of course, maybe this doesn’t really matter so much. The main result of the paper really is still null result, and nothing key turns on whether the point estimate is 0.04 or –0.04 (I had thought it was the former, but gather now that it is the latter). And there is open data for this paper, so anyone who really wants to dig into that could figure out what the correct calculation is.

But maybe it worth reflecting on just how slowly this is being corrected. I don’t know whether any of my emails after the first helped move this along, so maybe really anything beyond the first email, which was easy for me to write, did nothing. Perhaps my lesson here should be to post publicly (e.g. on PubPeer) with less of a delay.

Update: After posting the above, Ariella Kristal, the first author of this study, contacted me to share this document, which details the corrections to the paper. As a result, I’ve edited my statements above about what the correct numbers are, as the correct value is apparently d = –0.04 after all. She also emphasized that she contacted the journal about this matter several times as well.

[This post is by Dean Eckles.]

New research on social media during the 2020 election, and my predictions

Back in 2020, leading academics and researchers at the company now known as Meta put together a large project to study social media and the 2020 US elections — particularly the roles of Instagram and Facebook. As Sinan Aral and I had written about how many paths for understanding effects of social media in elections could require new interventions and/or platform cooperation, this seemed like an important development. Originally the idea was for this work to be published in 2021, but there have been some delays, including simply because some of the data collection was extended as what one might call “election-related events” continued beyond November and into 2021. As of 2pm Eastern today, the news embargo for this work has been lifted on the first group of research papers.

I had heard about this project back a long time ago and, frankly, had largely forgotten about it. But this past Saturday, I was participating in the SSRC Workshop on the Economics of Social Media and one session was dedicated to results-free presentations about this project, including the setup of the institutions involved and the design of the research. The organizers informally polled us with qualitative questions about some of the results. This intrigued me. I had recently reviewed an unrelated paper that included survey data from experts and laypeople about their expectations about the effects estimated in a field experiment, and I thought this data was helpful for contextualizing what “we” learned from that study.

So I thought it might be useful, at least for myself, to spend some time eliciting my own expectations about the quantities I understood would be reported in these papers. I’ve mainly kept up with the academic and  grey literature, I’d previously worked in the industry, and I’d reviewed some of this for my Senate testimony back in 2021. Along the way, I tried to articulate where my expectations and remaining uncertainty were coming from. I composed many of my thoughts on my phone Monday while taking the subway to and from the storage unit I was revisiting and then emptying in Brooklyn. I got a few comments from Solomon Messing and Tom Cunningham, and then uploaded my notes to OSF and posted a cheeky tweet.

Since then, starting yesterday, I’ve spoken with journalists and gotten to view the main text of papers for two of the randomized interventions for which I made predictions. These evaluated effects of (a) switching Facebook and Instagram users to a (reverse) chronological feed, (b) removing “reshares” from Facebook users’ feeds, and (c) downranking content by “like-minded” users, Pages, and Groups.

My guesses

My main expectations for those three interventions could be summed up as follows. These interventions, especially chronological ranking, would each reduce engagement with Facebook or Instagram. This makes sense if you think the status quo is somewhat-well optimized for showing engaging and relevant content. So some of the rest of the effects — on, e.g., polarization, news knowledge, and voter turnout — could be partially inferred from that decrease in use. This would point to reductions in news knowledge, issue polarization (or coherence/consistency), and small decreases in turnout, especially for chronological ranking. This is because people get some hard news and political commentary they wouldn’t have otherwise from social media. These reduced-engagement-driven effects should be weakest for the “soft” intervention of downranking some sources, since content predicted to be particularly relevant will still make it into users’ feeds.

Besides just reducing Facebook use (and everything that goes with that), I also expected swapping out feed ranking for reverse chron would expose users to more content from non-friends via, e.g., Groups, including large increases in untrustworthy content that would normally rank poorly. I expected some of the same would happen from removing reshares, which I expected would make up over 20% of views under the status quo, and so would be filled in by more Groups content. For downranking sources with the same estimated ideology, I expected this would reduce exposure to political content, as much of the non-same-ideology posts will be by sources with estimated ideology in the middle of the range, i.e. [0.4, 0.6], which are less likely to be posting politics and hard news. I’ll also note that much of my uncertainty about how chronological ranking would perform was because there were a lot of unknown but important “details” about implementation, such as exactly how much of the ranking system really gets turned off (e.g., how much likely spam/scam content still gets filtered out in an early stage?).

How’d I do?

Here’s a quick summary of my guesses and the results in these three papers:

Table of predictions about effects of feed interventions and the results

It looks like I was wrong in that the reductions in engagement were larger than I predicted: e.g., chronological ranking reduced time spent on Facebook by 21%, rather than the 8% I guessed, which was based on my background knowledge, a leaked report on a Facebook experiment, and this published experiment from Twitter.

Ex post I hypothesize that this is because of the duration of these experiments allowed for continual declines in use over months, with various feedback loops (e.g., users with chronological feed log in less, so they post less, so they get fewer likes and comments, so they log in even less and post even less). As I dig into the 100s of pages of supplementary materials, I’ll be looking to understand what these declines looked like at earlier points in the experiment, such as by election day.

My estimates for the survey-based outcomes of primary interest, such as polarization, were mainly covered by the 95% confidence intervals, with the exception of two outcomes from the “no reshares” intervention.

One thing is that all these papers report weighted estimates for a broader population of US users (population average treatment effects, PATEs), which are less precise than the unweighted (sample average treatment effect, SATE) results. Here I focus mainly on the unweighted results, as I did not know there was going to be any weighting and these are also the more narrow, and thus riskier, CIs for me. (There seems to have been some mismatch between the outcomes listed in the talk I saw and what’s in the papers, so I didn’t make predictions for some reported primary outcomes and some outcomes I made predictions for don’t seem to be reported, or I haven’t found them in the supplements yet.)

Now is a good time to note that I basically predicted what psychologists armed with Jacob Cohen’s rules of thumb might call extrapolate to “minuscule” effect sizes. All my predictions for survey-based outcomes were 0.02 standard deviations or smaller. (Recall Cohen’s rules of thumb say 0.1 is small, 0.5 medium, and 0.8 large.)

Nearly all the results for these outcomes in these two papers were indistinguishable from the null (p > 0.05), with standard errors for survey outcomes at 0.01 SDs or more. This is consistent with my ex ante expectations that the experiments would face severe power problems, at least for the kind of effects I would expect. Perhaps by revealed preference, a number of other experts had different priors.

A rare p < 0.05 result is that that chronological ranking reduced news knowledge by 0.035 SDs with 95% CI [-0.061, -0.008], which includes my guess of -0.02 SDs. Removing reshares may have reduced news knowledge even more than chronological ranking — and by more than I guessed.

Even with so many null results I was still sticking my neck out a bit compared with just guessing zero everywhere, since in some cases if I had put the opposite sign my estimate wouldn’t have been in the 95% CI. For example, downranking “like-minded” sources produced a CI of [-0.031, 0.013] SDs, which includes my guess of -0.02, but not its negation. On the other hand, I got some of these wrong, where I guessed removing reshares would reduce affective polarization, but a 0.02 SD effect is outside the resulting [-0.005, +0.030] interval.

It was actually quite a bit of work to compare my predictions to the results because I didn’t really know a lot of key details about exact analyses and reporting choices, which strikingly even differ a bit across these three papers. So I might yet find more places where I can, with a lot of reading and a bit of arithmetic, figure out where else I may have been wrong. (Feel free to point these out.)

Further reflections

I hope that this helps to contextualize the present results with expert consensus — or at least my idiosyncratic expectations. I’ll likely write a bit more about these new papers and further work released as part of this project.

It was probably an oversight for me not to make any predictions about the observational paper looking at polarization in exposure and consumption of news media. I felt like I had a better handle on thinking about simple treatment effects than these measures, but perhaps that was all the more reason to make predictions. Furthermore, given the limited precision of the experiments’ estimates, perhaps it would have been more informative (and riskier) to make point predictions about these precisely estimated observational quantities.

[This post is by Dean Eckles. I want to note that I was an employee or contractor of Facebook (now Meta) from 2010 through 2017. I have received funding for other research from Meta, Meta has sponsored a conference I organize, and I have coauthored with Meta employees as recently as earlier this month. I was also recently a consultant to Twitter, ending shortly after the Musk acquisition. You can find all my disclosures here.]

Effect size expectations and common method bias

I think researchers in the social sciences often have unrealistic expectations about effect sizes. This has many causes, including publication bias (and selection bias more generally) and forking paths. Old news here.

Will Hobbs pointed me to his (PPNAS!) paper with Anthony Ong that highlights and examines another cause: common method bias.

Common method bias is the well-known (in some corners at least) phenomenon whereby specific common variance in variables measured through the same methods can produce bias. You can come up with many mechanisms for this. Variables measured in the same questionnaire can be correlated because of consistency motivations, the same tendency to give social desirably responses, similar uses of the similar scales, etc.

Many of these biases result in inflated correlations. Hobbs writes:

[U]nreasonable effect size priors is one of my main motivations for this line of work.

A lot of researchers seem to consider effect sizes meaningful only if they’re comparable to the huge observational correlations seen among subjective closed-ended survey items.

But often the quantities we really care about — or at least we are planning more ambitious field studies to estimate — are inherently going to be not measured in the same ways. We might assign a treatment and measure a survey outcome. We might measure a survey outcome, use this to target an intervention, and then look at outcomes in administrative data (e.g., income, health insurance data).

Here at this blog, perhaps there’s the most coverage of tiny studies, forking paths, and selection bias as causes of inflated effect size expectations. So this is a good reminder there are plenty of other causes, even with big samples or pre-registered analysis plans, like common method bias and confounding more generally.

This post is by Dean Eckles.

 

Postdoctoral position at MIT: privacy, synthetic data, fairness & causal inference

I have appreciated Jessica’s recent coverage of differential privacy and related topics on this blog — especially as I’ve also started working in this general area.

So I thought I’d share this new postdoc position that Manish Raghavan and I have here at MIT where it is an important focus. Here’s some of the description of the broad project area, which this researcher would help shape:

This research program is working to understand and advance techniques for sharing and using data while limiting what is revealed about any individual or organization. We are particularly interested in how privacy-preserving technologies interface with recent developments in high-dimensional statistical machine learning (including foundation models), questions about fairness of downstream decisions, and with causal inference. Applications include some in government and public policy (e.g., related to US Census Bureau data products) and increasing use in multiple industries (e.g., tech companies, finance).

While many people with relevant expertise might be coming from CS, we’re also very happy to get interest from statisticians — who have a lot to add here!

This post is by Dean Eckles.

When plotting all the data can help avoid overinterpretation

Patrick Ruffini and David Weakliem both looked into this plot that’s been making the rounds, which seems to suggest a sudden drop in some traditional values:

Percent who say these values are 'very important' to them

But the survey format changed between 2019 and 2023, both moving online and randomizing the order of response options.

Perhaps one clue that you shouldn’t draw sweeping conclusions specific to these values is that there is a drop in the importance of “self-fulfillment” and “tolerance” too. Weakliem writes that once you collapse a couple response options…

there’s little change–they are almost universally regarded as important at all three times. The results for “self-fulfillment,” which isn’t mentioned in the WSJ article, are particularly interesting–the percent rating it as very important fell from 64% in 2019 to 53% in 2023. That’s hard to square with either the growing selfishness or the social desirability interpretations, but is consistent with my hypothesis. These figures indicate some changes in the last few years, but not the general collapse of values that is being claimed.

If the importance of everything drops at once, this might be a clue that selective interpretation of some thematically-related drops is likely not justified — whether this is because of survey format changes or otherwise (say something else becoming comparatively more important, but not asked about).

So perhaps this is a good reminder of the benefits of plotting more of the data — even if you want to argue the action is all in a few of the items. (You could even think of this as something like a non-equivalent comparison group or differences-in-differences design.)

Update: Here is a plot I made from the numbers from the Weakliem post. In making this plot, I formulated one guess of why the original plot has this weird x-axis: when making it with a properly scaled x-axis of years, you can easily run into problems with the tick labels running into each other. (Note that I copied the original use of “’23” as a shortening of 2023.)

Small multiples of WSJ/NORC survey data

[This post is by Dean Eckles.]

Successful randomization and covariate “imbalance” in a survey experiment in Nature

Last year I wrote about the value of testing observable consequences of a randomized experiment having occurred as planned. For example, if the randomization was supposedly Bernoulli(1/2), you can check that the number of units in treatment and control in the analytical sample isn’t so inconsistent with that; such tests are quite common in the tech industry. If you have pre-treatment covariates, then it can also make sense to test that they are not wildly inconsistent with randomization having occurred as planned. The point here is that things can go wrong in the treatment assignment itself or in how data is recorded and processed downstream. We are not checking whether our randomization perfectly balanced all of the covariates. We are checking our mundane null hypothesis that, yes, the treatment really was randomized as planned. Even if there is just a small difference in proportion treated or a small imbalance in observable covariates, if this is high statistically significant (say, p < 1e-5), then we should likely revise our beliefs. We might be able to salvage the experiment if, say, some observations were incorrectly dropped (one can also think of this as harmless attrition not being so harmless after all).

The argument against doing or at least prominently reporting these tests is that they can confuse readers and can also motivate “garden of forking paths” analyses with different sets of covariates than planned. I recently encountered some of these challenges in the wild. Because of open peer review processes, I can give a view into the peer review process for the paper where this came up.

I was a peer reviewer for this paper, “Communicating doctors’ consensus persistently increases COVID-19 vaccinations”, now published in Nature. It is an impressive experiment embedded in a multi-wave survey in the Czech Republic. The intervention provides accurate information about doctors’ trust in COVID-19 vaccines, which people perceived to be lower than it really was. (This is related to some of our own work on people’s beliefs about others’ vaccination intentions.) The paper presents evidence that this increased vaccination:

Fig. 4

This figure (Figure 4 from the published version of the paper) shows the effects by wave of the survey. Not all respondents participated in each wave, so this creates the “full sample”, which includes a varying set of people over time, while the “fixed sample” includes only those who are in all waves. More immediately relevant, there are two sets of covariates used here: a pre-registered set and a set selected using L1-penalized regression.

This differs from a prior version of the paper, which actually didn’t report the preregistered set, motivated by concerns about imbalance of covariates that hadn’t been left out of that set. In my first peer review report, I wrote:

Contrary to the pre-analysis plan, the main analyses include adjustment for some additional covariates: “a non-pre-specified variable for being vaccinated in Wave0 and Wave0 beliefs about the views of doctors. We added the non-specified variables due to a detected imbalance in randomization.” (SI p. 32)

These indeed seem like relevant covariates to adjust for. However, this kind of data-contingent adjustment is potentially worrying. If there were indeed a problem with randomization, one would want to get to the bottom of that. But I don’t see much evidence than anything was wrong; it is simply the case that there is a marginally significant imbalance (.05 < p < .1) in two covariates and a non-significant (p > .1) imbalance in another — without any correction for multiple hypothesis testing. This kind of data-contingent adjustment can increase error rates (e.g., Mutz et al. 2019), especially if no particular rule is followed, creating a “garden of forking paths” (Gelman & Loken 2014). Thus, unless the authors actually think randomization did not occur as planned (in which case perhaps more investigation is needed), I don’t see why these variables should be adjusted for in all main analyses. (Note also that there is no single obvious way to adjust for these covariates. The beliefs about doctors are often discussed in a dichotomous way, e.g., “Underestimating” vs “Overestimating” trust so one could imagine the adjustment being for that dichotomized version additionally or instead. This helps to create many possible specifications, and only one is reported.) … More generally, I would suggest reporting a joint test of all of these covariates being randomized; presumably this retains the null.

This caused the authors to include the pre-registered analyses (which gave similar results) and to note, based on a joint test, that there weren’t “systematic” differences between treatment and control. Still I remained worried that the way they wrote about the differences in covariates between treatment and control invited misplaced skepticism about the randomization:

Nevertheless, we note that three potentially important but not pre-registered variables are not perfectly balanced. Since these three variables are highly predictive of vaccination take-up, not controlling for them could potentially bias the estimation of treatment effects, as is also indicated by the LASSO procedure, which selects these variables among a set of variables that should be controlled for in our estimates.

In my next report, while recommending acceptance, I wrote:

First, what does “not perfectly balanced” mean here? My guess is that all of the variables are not perfectly balanced, as perfect balance would be having identical numbers of subjects with each value in treatment and control, and would typically only be achieved in the blocked/stratified randomization.

Second, in what sense is does this “bias the estimation of treatment effects”? On typical theoretical analyses of randomized experiments, as long as we believe randomization occurred as planned, error due to random differences between groups is not bias; it is *variance* and is correctly accounted for in statistical inference.

This is also related to Reviewer 3’s review [who in the first round wrote “There seems to be an error of randomization on key variables”]. I think it is important for the authors to avoid the incorrect interpretation that something went wrong with their randomization. All indications are that it occurred exactly as planned. However, there can be substantial precision gains from adjusting for covariates, so this provides a reason to prefer the covariate-adjusted estimates.

If I was going to write this paragraph, I would say something like: Nevertheless, because the randomization was not stratified (i.e. blocked) on baseline covariates, there are random imbalances in covariates, as expected. Some of the larger differences are variables that were not specified in the pre-registered set of covariates to use for regression adjustment: (stating the covariates, I might suggest reporting standardized differences, not p-values here).

Of course, the paper is the authors’ to write, but I would just advise that unless they have a reason to believe the randomization did not occur as expected (not just that there were random differences in some covariates), they should avoid giving readers this impression.

I hope this wasn’t too much of a pain for the authors, but I think the final version of the paper is much improved in both (a) reporting the pre-registered analyses (as well as a bit of a multiverse analysis) and (b) not giving readers the incorrect impression there is any substantial evidence that something was wrong in the randomization.

So overall this experience helped me fully appreciate the perspective of Stephen Senn and other methodologists in epidemiology, medicine, and public health that reporting these per-covariate tests can lead to confusion and even worse analytical choices. But I think this is still consistent with what I proposed last time.

I wonder what you all think of this example. It’s also an interesting chance to get other perspectives on how this review and revision process unfolded and on my reviews.

P.S. Just to clarify, it will often make sense to prefer analyses of experiments that adjust for covariates to increase precision. I certainly use those analyses in much of my own work. My point here was more that finding noisy differences in covariates between conditions is not a good reason to change the set of adjusted-for variables. And, even if many readers might reasonably ex ante prefer an analysis that adjusts for more covariates, reporting such an analysis and not reporting the pre-registered analysis is likely to trigger some appropriate skepticism from readers. Furthermore, citing very noisy differences in covariates between conditions is liable to confuse readers and make them think something is wrong with the experiment. Of course, if there is strong evidence against randomization having occurred as planned, that’s notable, but simply adjusting for observables is not a good fix.

[This post is by Dean Eckles.]

Why not look at Y?

In some versions of a “design-based” perspective on causal inference, the idea is to focus on how units are assigned to different treatments (i.e. exposures, actions), rather than focusing on a model for the outcomes. We may even want to prohibit loading, looking at, etc., anything about the outcome (Y) until we have settled on an estimator, which is often something simple like a difference-in-means or a weighted difference-in-means.

Taking a design-based perspective on a natural experiment, then, one would think about how Nature (or some other haphazard process) has caused units to be assigned to (or at least nudged, pushed, or encouraged into) treatments. Taking this seriously, identification, estimation, and inference shouldn’t be based on detailed features of the outcome or the researcher’s preference for, e.g., some parametric model for the outcome. (It is worth noting that common approaches to natural experiments, such as regression discontinuity designs, do in fact make central use of quantitative assumptions about the smoothness of the outcome. For a different approach, see this working paper.)

Taking a design-based perspective on an observational study (without a particular, observed source of random selection into treatments), one then considers whether it is plausible that, conditional on some observed covariates X, units are (at least as-if) randomized into treatments. Say, thinking of the Infant Health and Development Program (IHDP) example used in Regression and Other Stories, if we consider infants with identical zip code, sex, age, mother’s education, and birth weight, perhaps these infants are effectively randomized to treatment. We would assess the plausibility of this assumption — and our ability to employ estimators based on it (by, e.g., checking whether we have a large enough sample size and sufficient overlap to match on all these variables exactly) — without considering the outcome.

This general idea is expressed forcefully in Rubin (2008) “For objective causal inference, design trumps analysis”:

“observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data”

Randomized experiments “are automatically designed without access to any outcome data of any kind; again, a feature not entirely distinct from the previous reasons. In this sense, randomized experiments are ‘prospective.’ When implemented according to a proper protocol, there is no way to obtain an answer that systematically favors treatment over control, or vice versa.”

But why exactly? I think there are multiple somewhat distinct ideas here.

(1) If we are trying to think by analogy to a randomized experiment, we should be able assess the plausibility of our as-if random assumptions (i.e. selection on observables, conditional unconfoundedness, conditional exogeneity). Supposedly our approach is justified by these assumptions, so we shouldn’t sneak in, e.g., parametric assumptions about the outcome.

(2) We want to bind ourselves to an objective approach that doesn’t choose modeling assumptions to get a preferred result. Even if we aren’t trying to do so (as one might in a somewhat adversarial setting, like statisticians doing expert witness work), we know that once we enter the Garden of Forking Paths, we can’t know (or simply model) how we will adjust our analyses based on what we see from some initial results. (And, even if we only end up doing one analysis, frequentist inference needs to account for all the analyses we might have done had we gotten different results.) Perhaps there is really nothing special about causal inference or a design-based perspective here. Rather, we hope that as long as we don’t condition our choice of estimator on Y, we avoid a bunch of generic problems in data analysis and ensure that our statistical inference is straightforward (e.g., we do a z-test and believe in it).

So if (2) is not special to causal inference, then we just have to particularly watch out for (1).

But we often find we can’t match exactly on X. In one simple case, X might include some continuous variables. Also, we might find conditional unconfoundedness more plausible if we have a high-dimensional X, but this typically makes it unrealistic that we’ll find exact matches, even with a giant data set. So typical approaches relax things a bit. We don’t match exactly on all variables individually. We might match only on propensity scores, maybe splitting strata for many-to-many matching until we reach a stratification where there is no detectable imbalance. Or match after some coarsening, which often starts to look like a way to smuggle in outcome-modeling (even if some methodologists don’t want to call it that).

Thus, sometimes — perhaps in the cases where conditional unconfoundness is most plausible because we can theoretically condition on a high-dimensional X — we could really use some information about what covariates actually matter for the outcome. (This is because we need to deal with having finite, even if big, data.)

One solution is to use some sample splitting (perhaps with quite-specific pre-analysis plans). We could decide (ex ante) to use 1% of the outcome data to do feature selection, using this to prioritize which covariates to match on exactly (or close to it). For example, MALTS uses a split sample to learn a distance metric for subsequent matching. This seems like this can avoid the problems raised by (2). But nonetheless it involves bringing in quantitative information about the outcome.

Thus, while I like MALTS-style solutions (and we used MALTS in one of three studies of prosocial incentives in fitness tracking), it does seem like an important departure from a fully design-based “don’t make assumptions about the outcomes” perspective. But perhaps such a perspective is often misplaced in observational studies anyway — if we don’t have knowledge of what specific information was used by decision-makers in selection into treatments. And practically, with finite data, we have to make some kind of bias–variance tradeoff — and looking at Y can help us a bit with that.

[This post is by Dean Eckles.]

I had big plans for that four-fifths of a penny: False precision and fraud

Andrew likes to discourage false precision through reporting too many digits for estimates. I think this is good advice, especially for abstracts, summaries, and the primary outputs of much research.

I thought this was a particularly striking example of meaningless precision from the world of finance:

Specifically, for the AllianzGI Structured Alpha 1000 LLC fund, in one instance, Bond-Nelson materially reduced losses in a stress scenario from -22.0557450847078% to -12.0557450847078%.

My attention was brought to this by Matt Levine writing at Bloomberg, who further comments:

What. If you had invested one trillion dollars in that fund, in the hypothetical stress scenario you would lose $220,557,450,847.078, according to the “real” calculation. “Bummer,” you would say; “I can’t believe I lost two hundred twenty billion, five hundred fifty-seven million, four hundred fifty thousand, eight hundred forty-seven dollars and seven and four-fifths cents. I had big plans for that four-fifths of a penny!” But the fund’s managers nefariously sent you a risk report saying that in that stress scenario you would only lose 12.0557450847078% of your money, and not the more accurate 22.0557450847078% reported by their true stress tests.

This is perhaps the purest case I’ve seen of using extra digits to convey, as Levine says, that one did “Some Real Math” to get the number. It is like the digits are supposed to be a costly, and thus more honest, signal about the calculation.

In some cases, reporting many digits can indeed be a costly signal — in that if they aren’t based on the stated calculations, it may be possible to figure out that they are impossible (e.g., via a granularity-related inconsistency of means aka GRIM test). This is perhaps one argument for at least reporting excess digits in tables (though not abstracts and press releases certainly!). Perhaps this argument is somewhat outdated if data and analysis code are provided in addition to results in a paper or report itself, though this remains not always the case.

Here it also seems that perhaps the evidence that this was just fraud is clearer because some real calculations were done to get all those digits, but then either leading digits were changed or the the numbers were just divided by two. Or maybe the fraud would have been clear anyway, but we can definitely imagine how these extra digits can provide some evidence about exactly what happened.

For those following the intersection of error, fraud, and Excel spreadsheets, there are some more details in the complaint about some of the specific labor-intensive (!) steps for how to manipulate a bunch of spreadsheets.

[This post is by Dean Eckles.]

How different are causal estimation and decision-making?

Decision theory plays a prominent role in many texts and courses on theoretical statistics. However, the “decisions” being made are often as simple as using a particular estimator and then producing a point estimate — the point estimate, say, of an average treatment effect (ATE) of a some intervention is the “decision”. That is, these decisions are often substantially removed from the kinds of actions that policymakers, managers, doctors, and others actually have to make. (This is a point frequently made in decision theory texts, which often bemoan use of default loss functions; here I think of Berger’s Statistical Decision Theory and Bayesian Analysis and Robert’s The Bayesian Choice.)

These decision-makers are often doing things like allocating units to two or more different treatments: they have to, for a given unit, put them in treatment or control or perhaps one of a much higher-dimensional space of treatments. When possible, there can be substantial benefits from incorporating knowledge of this actual decision problem into data collection and analysis.

In a new review paper by Carlos Fernandez-Loria and Foster Provost, they explore how this kind of decision-making importantly differs from estimation of causal effects, highlighting that even highly confounded observational data can be useful for learning policies for targeting treatments. Much of the argument is that the objective functions in decision-making are different and this has important consequences (e.g., if a biased estimate still yields the same treat-or-not decision, no loss is incurred). This paper is worth reading and it points to a bunch of relevant recent — and less recent — literature. (For example, this made me aware that the expression of policy learning as a cost-sensitive classification problem originated with Bianca Zadrozny in her dissertation and some related papers.)

Here I want to spell out related but distinct reasons underlying their contrast between causal estimation and decision-making. These are multiple uses of estimates, bias–variance tradeoffs, and the loss function.

First, a lot of causal inference is done with multiple uses in mind. The same estimates and confidence intervals might be used to test a theory, inform a directly related decision (e.g., whether to expand an experimental program), inform less related decisions (e.g., in a different country, market, etc., a different implementation), and as inputs to a meta-analysis conducted years later. That is, these are often “multipurpose” estimates and models. So sometimes the choice of analysis (and how it is reported, such as reporting point estimates in tables) can be partially justified by the fact that the authors want to make their work reusable in multiple ways. This can also be true in industry — not just academic work — such as when an A/B test is both used to make an immediate decision (launch the treatment or not?) and also informs resource allocation (should we assign more engineers to this general area?).

More explicitly linking this “multipurpose” property to loss functions, meta-analysis (or less formal reviews of the literature) can be one reason to value reporting (perhaps not as the only analysis) estimates that are (nominally) unbiased. Aronow & Middleton (2013) write:

Unbiasedness may not be the statistical property that analysts are most interested in. For example, analysts may choose an estimator with lower root mean squared error (RMSE) over one that is unbiased. However, in the realm of randomized experiments, where many small experiments may be performed over time, unbiasedness is particularly important. Results from unbiased but relatively inefficient estimators may be preferable when researchers seek to aggregate knowledge from many studies, as reported estimates may be systematically biased in one direction.

So the fact that others are going to use the estimates in some not-entirely-anticipated ways can motivate preferring unbiasedness (even at the cost of higher variance and thus higher squared error). [Update: Andrew points out in the comments that, of course, conditional on seeing some results from an experiment, the estimates are not typically unbiased! I think this is true in many settings, though there can be exceptions, such as when all experiments are run through a single common system or process.]

This thus avoids the bias–variance tradeoff by at least pretending to have a lexicographic preference for one over the other. But as long as our loss function is something like MSE, then we will want to use potentially confounded observational data to improve our estimates. In some cases, this might be inadmissible to neglect such big, “bad” data.

Lastly, as Fernandez-Loria and Provost note, the loss function (or, alternatively, the objective function as they put it) can substantially change a problem. If our only decision is whether to launch the treatment to everyone or not, then error and uncertainty that doesn’t result in us making the wrong (binary) decision don’t incur any loss. The same is not true for MSE. We make a related point in our paper on learning targeting policies using surrogate outcomes, where we use historical observational data to fit a model that imputes long-term outcomes using only short-run surrogates. If one is trying to impute long-term outcomes to estimate a ATE or a conditional ATE (CATEs for various subgroups), then it is pretty easy for violations of the assumptions to result in error. However, if one is only making binary decisions, then these violations have to be enough to flip some signs before you incur loss. So using a surrogacy model can be justified under weaker assumptions if “just” doing causal decision-making. However, when does this make a difference? If many true CATEs are near zero (the decision boundary), then just a little error in estimating them (perhaps due to the surrogacy assumptions being violated) will still result in loss. So how important this difference in loss function is may depend importantly on the true distribution of treatment effects.

Thus, I agree that causal decision-making is often different than causal estimation and inference. However, some of this is because of particular, contingent choices (e.g., to value unbiasedness above reducing MSE) that make a lot of sense when estimates are reused, but may not make sense in some applied settings. So I perhaps wouldn’t attribute so much of the difference to the often binary or categorical nature of decisions to assign units to treatments, but instead I would pin this to “single-purpose” vs. “multi-purpose” differences between what we typically think of as decision-making and estimation.

I’d be interested to hear from readers about how much this all matches your experiences in academic research or in applied work in industry, government, etc.

[This post is by Dean Eckles.]

Update: I adapted this post into a commentary on Fernandez-Loria and Provost’s paper. Both are now published at INFORMS Journal on Data Science.

Does the “Table 1 fallacy” apply if it is Table S1 instead?

In a randomized experiment (i.e. RCT, A/B test, etc.) units are randomly assigned to treatments (i.e. conditions, variants, etc.). Let’s focus on Bernoulli randomized experiments for now, where each unit is independently assigned to treatment with probability q and to control otherwise.

Thomas Aquinas argued that God’s knowledge of the world upon creation of it is a kind of practical knowledge: knowing something is the case because you made it so. One might think that that in randomized experiments we have a kind of practical knowledge: we know that treatment was randomized because we randomized it. But unlike Aquinas’s God, we are not infallible, we often delegate, and often we are in the position of consuming reports on other people’s experiments.

So it is common to perform and report some tests of the null hypothesis that this process did indeed generate the data. For example, one can test that the sample sizes in treatment and control aren’t inconsistent with this. This is common at least in the Internet industry (see, e.g., Kohavi, Tang & Xu on “sample ratio mismatch”), where it is often particularly easy to automate. Perhaps more widespread is testing whether the means of pre-treatment covariates in treatment and control are distinguishable; these are often called balance tests. One can do per-covariate tests, but if there are a lot of covariates then this can generate confusing false positives. So often one might use some test for all the covariates jointly at once.

Some experimentation systems in industry automate various of these tests and, if they reject at, say, p < 0.001, show prominent errors or even watermark results so that they are difficult to share with others without being warned. If we’re good Bayesians, we probably shouldn’t give up on our prior belief that treatment was indeed randomized just because some p-value is less than 0.05. But if we’ve got p < 1e-6, then — for all but the most dogmatic prior beliefs that randomization occurred as planned — we’re going to be doubtful that everything is alright and move to investigate.

In my own digital field and survey experiments, we indeed run these tests. Some of my papers report the results, but I know there’s at least one that doesn’t (though we did the tests) and another where we just state they were all not significant (and this can be verified with the replication materials). My sense is that reporting balance tests of covariate means is becoming even more of a norm in some areas, such as applied microeconomics and related areas. And I think that’s a good thing.

Interestingly, it seems that not everyone feels this way.

In particular, methodologists working in epidemiology, medicine, and public health sometimes refer to a “Table 1 fallacy” and advocate against performing and/or reporting these statistical tests. Sometimes the argument is specifically about clinical trials, but often it is more generally randomized experiments.

Stephen Senn argues in this influential 1994 paper:

Indeed the practice [of statistical testing for baseline balance] can accord neither with the logic of significance tests nor with that of hypothesis tests for the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized, and hence either that the trialist has practised deception and has dishonestly manipulated the allocation or that some incompetence, such as not accounting for all patients, has occurred.

In my opinion this is not the usual reason why such tests are carried out (I believe the reason is to make a statement about the observed allocation itself) and I suspect that the practice has originated through confused and false analogies with significance and hypothesis tests in general.

This highlights precisely where my view diverges: indeed the reason I think such tests should be performed is because I think that they could lead to the conclusion that “the treatment groups have not been randomized”. I wouldn’t say this always rises to the level of “incompetence” or “deception”, at least in the applications I’m familiar with. (Maybe I’ll write about some of these reasons at another time — some involve interference, some are analogous to differential attrition.)

It seems that experimenters and methodologists in social science and the Internet industry think that broken randomization is more likely, while methodologists mainly working on clinical trails put a very, very small prior probability on such events. Maybe this largely reflects the real probabilities in these areas, for various reasons. If so, part of the disagreement simply comes from cross-disciplinary diffusion of advice and overgeneralization. However, even some of the same researchers are sometimes involved in randomized experiments that aren’t subject to all the same processes as clinical trials.

Even if there is a small prior probability of broken randomization, if it is very easy to test for it, we still should. One nice feature of balance tests compared with other ways of auditing a randomization and data collection process is that they are pretty easy to take in as a reader.

But maybe there are other costs of conducting and reporting balance tests?

Indeed this gets at other reasons some methodologists oppose balance testing. For example, they argue that it fits into an, often vague, process of choosing estimators in a data-dependent way: researchers run the balance tests and make decisions about how to estimate treatment effects as a result.

This is articulated in a paper in The American Statistician by Mutz, Pemantle & Pham, which includes highlighting how discretion here creates a garden of forking paths. In my interpretation, the most considered and formalized arguments are saying that conducting balance tests and then using that to determine which covariates to include in the subsequent analysis of treatment effects in randomized experiments has bad properties and shouldn’t be done. Here the idea is that when these tests provide some evidence against the null of randomization for some covariate, researchers sometimes then adjust for that covariate (when they wouldn’t have otherwise); and when everything looks balanced, researchers use this as a justification for using simple unadjusted estimators of treatment effects. I agree with this, and typically one should already specify adjusting for relevant pre-treatment covariates in the pre-analysis plan. Including them will increase precision.

I’ve also heard the idea that these balance tests in Table 1 confuse readers, who see a single p < 0.05 — often uncorrected for multiple tests — and get worried that the trial isn’t valid. More generally, we might think that Table 1 of a paper in a widely read medical journal isn’t the right place for such information. This seems right to me. There are important ingredients to good research that don’t need to be presented prominently in a paper, though it is important to provide information about them somewhere readily inspectable in the package for both pre- and post-publication peer review.

In light of all this, here is a proposal:

  1. Papers on randomized experiments should report tests of the null hypothesis that treatment was randomized as specified. These will often include balance tests, but of course there are others.
  2. These tests should follow the maxim “analyze as you randomize“, both accounting for any clustering or blocking/stratification in the randomization and any particularly important subsetting of the data (e.g., removing units without outcome data).
  3. Given a typically high prior belief that randomization occurred as planned, authors, reviewers, and readers should certainly not use p < 0.05 as a decision criterion here.
  4. If there is evidence against randomization, authors should investigate, and may often be able to fully or partially fix the problem long prior to peer review (e.g., by including improperly discarded data) or in the paper (e.g., by identifying the problem only affected some units’ assignments, bounding the possible bias).
  5. While it makes sense to mention them in the main text, there is typically little reason — if they don’t reject with a tiny p-value — for them to appear in Table 1 or some other prominent position in the main text, particularly of a short article. Rather, they should typically appear in a supplement or appendix — perhaps as Table S1 or Table A1.

This recognizes both the value of checking implications of one of the most important assumptions in randomized experiments and that most of the time this test shouldn’t cause us to update our beliefs about randomization much. I wonder if any of this remains controversial and why.

[This post is by Dean Eckles. This is my first post here. Because this post discusses practices in the Internet industry, I note that my disclosures include related financial interests and that I’ve been involved in designing and building some of those experimentation systems.]