Some experiments are just too noisy to tell us much of anything at all: Political science edition

Posted on May 29, 2018 9:39 AM by Andrew

Sointu Leikas pointed us to this published research article, “Exposure to inequality affects support for redistribution.” Leikas writes that “it seems to be a really apt example of “researcher degrees of freedom.'”

Here’s the abstract of the paper:

As the world’s population grows more urban, encounters between members of different socioeconomic groups occur with greater frequency. I provide real-world experimental evidence that exposure to inequality shapes decision-making. By randomly assigning microenvironments of inequality, this study builds on observational research linking the salience of inequality to antisocial behavior, as well as survey experimental evidence connecting perceived inequality to diminished generosity. Specifically, I show that exposure to socioeconomic inequality in an everyday setting negatively affects willingness to publicly support a redistributive economic policy. This study advances our understanding of how environmental factors, such as exposure to racial and economic outgroups, affect human behavior in consequential ways.

I agree with Leikas about researcher degrees of freedom. For example, here’s a bit from the article in question:

Based on column 1, subjects are, on average, 4.4 percentage points (pp) less likely to support the redistributive policy in the presence of a poor person (P < 0.10), pooling across confederate race conditions. The specification estimated in column 2 allows the treatment effect to vary by confederate race. Here, the estimated coefficient on poor actor nearly doubles in magnitude when accounting for the interaction between the poverty and race treatments.

There are a lot of possible analyses here, and the result indicates enough noise that I find it inappropriate that the abstract presents claims such as “in a real-world-setting exposure to inequality decreases affluent individuals’ willingness to redistribute. The finding that exposure to inequality begets inequality has fundamental implications . . .”, without acknowledging uncertainty in these claims at all.

Here’s the relevant table:

The main result is “p less than 0.1.” That’s fine, that’s what happens when you have noisy data; the point is that such a result is consistent with no effect, or a highly variable effect that might be positive in some settings and negative in others. It’s not much evidence of anything at all, and I think it represents a failure of the research process that this result would be treated as strong evidence, and a failure of the reviewing process that this was published as such.

Some more forking paths:

Exposure to white poverty in an affluent setting decreases support for a redistributive policy by 8.2pp (P < 0.05), a substantively and statistically significant decline.

When data are noisy, any difference that’s statistically significant will be substantively significant. So that isn’t telling us anything more than that our data are noisy. And of course given all the many possible comparisons here, one can easily see some p-values less than 0.05 by chance alone, even in the presence of no effects at all.

I’m not saying I think that the effects are zero; rather, I’m saying that these p-values are consistent with null effects. More to the point, these p-values are consistent with effects that are unpredictable, situation-dependent, and not possible to accurately measure using this sort of small experiment.

To put it another way, “forking paths” is another way of saying that these experimental data can be analyzed in so many ways that just about any data pattern can be declared a win.

And more:

Meanwhile, subjects are not significantly more or less likely to support reducing the use of plastic bags under any of the conditions, based on columns 3 and 4 of Table 2. These coefficient estimates are relatively small in magnitude and are statistically indistinguishable from zero.

When data are noisy, the fact that a comparison is not statistically significant should not be taken as evidence of a negligible or small effect. Rather, it just tells us that our data are noisy and we can’t say much; the true differences here could be large and just difficult to measure in this experiment.

And this:

The findings are highly similar; subjects are 8pp less likely to support the redistributive policy in the presence of a poor person, net of baseline response rates (P < 0.05), and this effect is driven exclusively by the poor white condition.

A statement such as “this effect is driven exclusively by the poor white condition” represents a two-way or three-way interaction. Its estimate will be noisy and there’s no reason at all to think that this sort of pattern represents any underlying truth or that it would replicate in a future study. It’s a description of the data, which is fine—but the point of this sort of study is not the data, the point is what we can learn about the real world.

And this:

The preregistered hypotheses anticipated a positive relationship between exposure to inequality and support for distribution. However, recent research published on the topic suggests a negative relationship (e.g., refs. 4 and 10). My findings are consistent with the latter.

Whoa! If I’m reading this correctly, the author is saying that the preregistered comparison was (a) in the opposite direction from expected, and (b) not statistically significant (that one-starred p < 0.1, remember?). If your preregistered hypothesis doesn't pan out, and then you have to sift through about 20 comparisons in order to find something at p less than 0.05, isn't that usually when you'd throw in the towel? (Or perhaps do a hierarchical Bayesian analysis, partially pooling the estimates way toward zero, and then report that the experiment was too noisy to tell us much?) But that's not what we get. From the conclusion:

This study uses a randomized placebo-controlled field experiment to establish the causal effect of exposure to inequality on support for redistribution.

That’s right—the study didn’t just “estimate” “a” causal effect; it “established” “the” causal effect. Two big assumptions here: first that a a noisy estimate has established something; second that whatever they’re estimating is “the” causal effect of exposure to inequality etc.

Later:

Our understanding of the relationship between inequality and redistribution at the individual level is advanced.

That I’ll accept. First, it seems that the preregistered hypothesis was not borne out by the data. So I guess our understanding has advanced in that now we’re less likely to believe that initial hypothesis. Second, one might have thought that a small experiment of this sort would be enough to allow us to estimate the effect of exposure to inequality on support for redistribution. But these data are so noisy: we’ve learned that this experiment was not sufficient to carry the burden that has been placed on it.

P.S. As usual in such cases, I have no problem with these data being published; my problem is that the paper (implicitly endorsed by the journal) makes strong claims that are not supported by the data.

Given the summaries in the paper, the data are consistent with zero effects or, more relevantly, are consistent with effects that are highly variable and unpredictable. The posited interactions are possible and are consistent with the data, but opposite signs of those interactions are also possible and consistent with the data—and in general I think that when data are too noisy to pick up main effects, that it will be close to impossible to find interactions. Also I encourage researchers in this area to follow up with between-person comparisons wherever possible.

These sorts of effects are difficult to study, and I would not necessarily have said ahead of time that this experiment was doomed to be noisy. Or, for that matter, that future experiments in this area are doomed in any way. Who knows—maybe the experiment was a good idea, a priori. But after the noisy data come out, that’s the time to reflect, and consider that this particular measurement idea didn’t happen to work out.

P.P.S. Let me also deal with a couple of issues that sometimes arise:

First, this is not a slam on the people who did this study. We all have some hits and some misses; that’s the way research goes. This particular experiment was a potentially good idea that didn’t work. The problem is not with the study but in the way that it was mistakenly presented as a success.

Second, why write about this at all? I write about this not because I get any joy in discussing statistics errors—quite the contrary, it makes me sad—but because I think the topics being studied here, including political polarization and attitudes toward redistribution, are important. Getting lost amid patterns in noise isn’t going to help, so sometimes we have to go through the awkward steps of pointing out problems in published work.

There remains a big problem in science in that researchers are encouraged, pressured even, to declare successes from equivocal data. And then you get selection bias: had these results been reported straight (for example, first saying that the preregistered hypothesis did not work out, then following up by analyzing interactions using a multilevel model which would’ve partially pooled everything to near zero, then concluding by saying that the results showed that the data were too noisy to learn anything useful), then I’m pretty sure the paper would not have been published in PNAS, it wouldn’t have received media attention, and it wouldn’t have won two awards. Even in an environment where everyone’s doing their best, it can be the misinterpretations of data that get attention. In this case I think it’s unfortunate that the data came from a randomized experiment, as that can give an inappropriate air of rigor to the resulting claims.

32 thoughts on “Some experiments are just too noisy to tell us much of anything at all: Political science edition”

David Manheim on May 29, 2018 12:01 PM at 12:01 pm said:

I think it may be useful to explicitly consider the class of “appropriately underpowered” studies I discussed briefly here – https://medium.com/@davidmanheim/the-good-the-bad-and-the-appropriately-under-powered-82c335652930 It makes, I think, a closely related point, but suggests that in many domains the correct methodological and substantive approach is to analyse a dataset that we know will not be sufficient to allow the types of conclusions we would like, and at best provide suggestive evidence. The intro (half the post) is reproduced below:

“Many quantitative studies are good — they employ appropriate methodology, have properly specified, empirically valid hypotheses registered before data collection, then collect sufficient data transparently and appropriately. Others fail at one or more of these hurdles. But a third category also exists; the appropriately under-powered. Despite doing everything else right, many properly posed questions cannot be answered with the potentially available data.

Two examples will illustrate this point. It is difficult to ensure the safety and efficacy of treatments for sufficiently rare diseases in the typical manner, because the total number of cases can be insufficient for a properly powered clinical trial. Similarly, it is difficult to answer a variety of well-posed, empirical questions in political science, because the number of countries to be used as samples is limited.”

Reply ↓
- Andrew on May 29, 2018 12:32 PM at 12:32 pm said:
  
  David:
  
  Interesting point. The difference is that, in your case, the data, noisy as they are, are directly relevant to some questions of interest such as the safety or efficacy of a drug. In the paper discussed above, the data have no direct connection to anything, so if these data are too noisy, the whole thing is essentially useless (except for informing us that, in retrospect, this approach to studying the problem was not helpful).
  
  One thing that I find stunning in the above example is how weak the theory is. The researchers went to the trouble of forming a hypothesis, preregistering it, designing and conducting a whole study—and then when the result from this particular small sample happened to go in the opposite direction, they just abandoned their theory entirely and decided that the exact opposite of their preregistered hypothesis must be true. The whole episode shows a similar lack of conviction in the field of political science, that they would embrace this result and give it awards. What happened to that original theory that was previously believed to be so strong that it motivated the experiment? Now, don’t get me wrong, I’m a big believer in updating my priors given data. But in this case there seems to have been a drastic over-updating, in which the priors were entirely discarded and the all data problems were set aside. This is a recipe for a continuing stream of headlines, overreactions, and reassessments.
  
  Reply ↓
  - Sameera Daniels on May 29, 2018 12:39 PM at 12:39 pm said:
    
    So true Andrew
    
    Reply ↓
  - Martha (Smith) on May 29, 2018 5:22 PM at 5:22 pm said:
    
    +1
    
    Reply ↓
  - David Manheim on May 30, 2018 12:27 PM at 12:27 pm said:
    
    Interesting point about theory-free science – I certainly agree. In many ways, it’s a simple problem – people think statistics are science, instead of understanding the fundamental relationship between hypotheses, data, and how to update your opinions about the former using the latter. In addition to the old standby claim that scientists shouldn’t be allowed to do scientific studies without consulting a competent statistician, perhaps we need to add that statisticians shouldn’t be allowed to do statistics without consulting a competent scientist. (In this case, competent means understanding some very basic principles of how science is supposed to work. You’d think this gets covered in high school, but from personal experience it doesn’t get covered clearly enough, and it doesn’t sink in.)
    
    Reply ↓
    - Andrew on May 30, 2018 12:44 PM at 12:44 pm said:
      
      David:
      
      As far as I know, no statisticians were involved in the project discussed above. It was all political scientists, I think. My guess is that what happened was a mixture of overcommitment and group-think. The overcommitment came when somebody, somewhere, decided that this was a great idea for an experiment. The group-think came when various people signed off on the results because other people signed off on them. The paper appeared in PNAS, that was evidence for giving it awards, then in turn it’s hard for people in the loop to imagine that the paper is fatally flawed. At this point it takes a big jump for the insiders to realize that they’d made a mistake. That’s why, in my comment to Matt Blackwell elsewhere in this thread, I attempted to shake him out of his assumptions by imaging that the paper had not been published but instead was an Arxiv preprint by someone he’d never heard of. My aim was to remove the sunk costs and groupthink and give him a chance to see the paper afresh.
    - Andrew on May 30, 2018 12:47 PM at 12:47 pm said:
      
      P.S. The other problem was the causal identification assured by the randomized design. A lot of social scientists seem to turn off their critical thinking when a randomized experiment is involved. They forget about questions of validity (are you measuring something of interest) and reliability (are your measurements stable enough to allow you to learn anything generalizable from your sample).
    - Martha (Smith) on May 30, 2018 11:12 PM at 11:12 pm said:
      
      “A lot of social scientists seem to turn off their critical thinking when a randomized experiment is involved. ”
      
      +1
    - Sameera Daniels on May 30, 2018 12:57 PM at 12:57 pm said:
      
      These controversies in statistics, along with several other controversies, should be covered in high school. Develops conceptual & practical reasoning. But then the question is who can qualify to present these controversies in a cogent and systematic manner. It would take someone with the intellectual ability of say the late Jerome Bruner, although Bruner was not a statistician. I would be interested in exploring the statistics curricula used in high schools, if even it is taught there. Some wealthier communities may include it.
      
      Look, for example, what’s come of the Common Core curricula? I have not heard one word about it given the debates surrounding its efficacy.
      
      In reading biographies, I think the key is the quality of intellectual engagement during childhood is decisive in determining who is going to improve upon a particular field or issue. This is JUST an OPINION.
- Sameera Daniels on May 29, 2018 1:09 PM at 1:09 pm said:
  
  David, not simply the limited # of countries used as samples but the limited number of expertise that has traditionally been brought to bear on the some issue. For the most part they are pretty much opinions based on other opinions. Then again I’m not sure specifically which issue you had in mind.
  
  Reply ↓
Ed Hagen on May 29, 2018 12:04 PM at 12:04 pm said:

Here’s an idea for a new topic: whenever a new stats textbook, or new edition of a stats textbook, is published, assess how well, or if at all, it covers all the issues covered in this blog: noisy data, failures to replicate, interpretations of p-values, and so forth. Does is discuss the work of Cohen, Meehl, Gigerenzer, Simonsohn, Ioannidis, etc? Last time I did a quick search on Amazon, I couldn’t find one that did. Most students, and probably their advisors too, learn stats from textbooks. If those aren’t fixed, it’s gonna take a long time to change things.

Reply ↓
- Martha (Smith) on May 29, 2018 5:23 PM at 5:23 pm said:
  
  +1
  
  Reply ↓
Sameera Daniels on May 29, 2018 12:25 PM at 12:25 pm said:

I have before me:

1. Introduction to the New Statistics by Geoff Cumming

2. Beyond Significance Testing by Rex Kline

Each covers all of them, in different degrees.

Beyond Significance Testing refers to Meehl and Gigerenzer. But not to Simonsohn and Ioannidis The latter two’s seminal viewpoints came after 2002. Kline has a very good chapter on what is wrong with significant tests. An entire chapter on replication etc

Reply ↓
Vince S on May 29, 2018 12:34 PM at 12:34 pm said:

Psychology is well on its way to reverting to a pseudoscience just like in the good old days of Freud.

Publish research which supports the politically-desired narrative and you get a pass on such plebeian concerns as correct statistics and inference. You get to publish in PNAS or Psych Science, and be heralded for your scientific expertise and accomplishment. Or even forget about statistics, altogether. There wasn’t a statistically significant difference between white men and white women participants yet the discussion focused on white men. And forget about a detailed discussion, with caveats, alternative explanations, etc. Your data supports the narrative (or at least, your Procrustean bed can make it do so) and that’s all you need.

Now in the real world, the prior that someone changes his views on income redistribution (or more specifically, a more progressive income tax scheme) based on an unknown person he passed on the street thirty seconds ago is zero, or essentially zero, making all the conclusions from this study about the connection between “exposure to inequality” and “support for income redistribution” quite ridiculous. True, it may have a short-term effect on his motivation to sign a petition supporting a particular policy position, for a variety of possible reasons. That’s the only real conclusion that can be drawn. Who knows? Maybe passing a rich person in the street makes people envious, and provokes in them the short-term desire to punish the rich. That’s a much more plausible explanation that the one advanced by the author.

And when this study (inevitably) fails to replicate, the author will no doubt be able to claim that the conditions weren’t replicated exactly; no doubt, the replication attempt will have taken place somewhere else than a Boston suburb and the particular political measure will have been different as well. Ain’t it grand, getting to have your cake and eat it too? You get to claim generalizability using specific sample yet refute claims of ungeneralizing (e.g. not replicating) because the sample is different.

Reply ↓
- Andrew on May 29, 2018 1:53 PM at 1:53 pm said:
  
  Vince:
  
  Don’t blame psychology for this one. It’s a political science paper, and it won two awards in political science.
  
  Also, I don’t think your last paragraph is appropriate. Not fair to blame people for something they haven’t actually done! I think it better to just say that their conclusions went beyond their data, and it’s unfortunate that the various people involved, including journal editors and award committees, didn’t catch these problems.
  
  Reply ↓
  - Vince S on May 29, 2018 3:03 PM at 3:03 pm said:
    
    OK, fair points.
    
    Reply ↓
  - Dean Eckles on June 2, 2018 7:55 PM at 7:55 pm said:
    
    But wasn’t it edited at PNAS by a psychologist, rather than being published by a political science journal or edited by a political scientist?
    
    Reply ↓
    - Andrew on June 2, 2018 8:07 PM at 8:07 pm said:
      
      Dean:
      
      Sure, but I would characterize the field of a paper based on the field of its authors, not who edits the journal. And the paper received two awards from the American Political Science Association!
dl on May 29, 2018 1:34 PM at 1:34 pm said:

Right, the preregistration thing is pretty brutal. Not to mention, one study cited for the reversing the hypothesis is the “Air Rage” paper and the other is something out of PNAS also.

Reply ↓
anon on May 29, 2018 3:46 PM at 3:46 pm said:

Matt Blackwell tweets:

I don’t understand the criticism here. The experiment is a 2×2 design and the models here are diff-in-means. This is standard experimental design.

…

It’s one thing if you think these results are just noise, but it’s a completely different thing to imply that the author took “forking paths” when they analyzed their experiment in a completely standard way.

Reply ↓
- Andrew on May 29, 2018 5:25 PM at 5:25 pm said:
  
  Anon:
  
  There are two aspects of the analysis:
  
  1. A preregistered analysis with no forking paths that I noticed. The result of this preregistered analysis was in the wrong direction and not statistically significant at the conventional level, hence would typically be reported as a failure. That’s kind of the point of preregistration, to lay out your hypothesis ahead of time. Given that the observed result went in the wrong direction but was still labeled as a success, that’s a forking path right there. It’s pretty clear that had the result gone in the expected direction, that would’ve been labeled as a success; indeed it would’ve looked like even stronger evidence, as there would’ve been no need to come up with a new theory to overrule the theory involved in the preregistration.
  
  2. A bunch of interactions and other analyses, some of which had p less than 0.05. There were many many possible things to look at here, hence forking paths.
  
  Reply ↓
  - Matt Blackwell on May 29, 2018 8:14 PM at 8:14 pm said:
    
    I suppose we differ about how seriously to take p-value cutoffs and preregistration hypotheses. The baseline effect does have a high p-value, yes, and it differs from the preregistration. But this seems like a call for a follow-up study and a replication, not speculation about whether or not this study is finding noise.
    
    On the preregistration specifically, I don’t want to punish a researcher for making a directional hypothesis when theoretical expectations are murky, at best. If they had simply set a two-sided alternative, then this would be considered a failure at all. In general, I think we should probably refrain from making directional hypotheses in preregs unless there is some clear policy reason to do so, as in “we should only allow this drug to be implemented if it has a positive treatment effect on health.” Otherwise, the goal of these experiments is to learn about how people react to situations and however the evidence comes out, it helps us to further understand the mechanisms involved, if any.
    
    The interactions between race and poverty were preregistered, but were found to be insignificant. One other set of interactions was included and seemed exploratory to me. None of those analyses are definitive, but I also don’t think they undercut what we learn from the baseline results of the poverty manipulation.
    
    Reply ↓
    - Andrew on May 29, 2018 8:41 PM at 8:41 pm said:
      
      Matt:
      
      1. Anything can be a call for a follow-up study and a replication! But I don’t see evidence that these data are distinguishable from noise. So I recommend that anyone doing a follow-up study think much more seriously about measurement, rather then just setting up a street scene and hoping to get good results.
      
      2. You talk about “speculation about whether or not this study is finding noise.” I don’t think that’s a useful way to frame things. What we have are some noisy measurements. Any set of data, if analyzed enough, will show patterns. There’s no speculation on my part; the speculation is all in the linked article. All I’m saying is that I don’t find the speculation at all convincing.
      
      3. Nobody’s talking about “punishing a researcher.” I think it was a mistake for this paper to have been published with such strong statements, and I think it was a mistake for a committee of political scientists to give the paper an award. But not publishing a paper as is, or not giving a paper an award, is not a punishment!
      
      One helpful heuristic, I think, is to imagine that this paper was not yet published and so was not endowed with that air of authority, but instead was, say, a preprint on Arxiv. If someone shows me an Arxiv preprint of a political science experiment, and the data are really noisy, with results all over the map, my reaction would be to say that it didn’t work out. If anyone wants to replicate the study in this preprint, they can go for it, but I’d recommend rethinking. I certainly wouldn’t recommend that the study get uncritical media coverage.
      
      I’m supportive of using experiments to lean about the world. It just turns out that this particular experiment gave noisy results, so I think it was an interesting idea that didn’t work out.
      
      To put it another way
    - Vince S on May 30, 2018 11:48 AM at 11:48 am said:
      
      Matt:
      
      Put all the lipstick you want on this pig, but it’s still a pig.
      
      1. If “theoretical expectations are murky, at best” there is little or no prior knowledge, and there is every reason to suspect the (future) data will contain nothing but noise. Failure to achieve significance for the baseline effect maintains the suspicion. Yes, there should definitely be an attempted replication study before it can be held as “shown” that “exposure to inequality negatively impacts support for income redistribution”. That is Andrew’s point (I think).
      
      2. The “two-sided alternative” would also have been considered a failure, not having met the magical p < 0.05, a threshold which can mysteriously be relaxed (apparently) if it suits the ideological predilections of the editor and/or the reviewers, but is clear and obvious grounds for rejecting the paper otherwise, based on "objective" scientific criteria. And I disagree with your main point anyway. Except for very rare cases, for the authors not to have a direction for their hypothesis is to admit their experiment is a fishing expedition, for hypotheses should be based on prior knowledge, which will point (in the vast majority of cases) to a specific direction.
      
      3. What exactly did we learn from the "baseline results of the poverty manipulation"? How on earth do you generalize from this specific population, and this specific manipulation? How do you exclude the alternative explanation that exposure to a well-off individual made the study participants temporarily envious, producing in them the short-term desire to punish the rich?
Martha (Smith) on May 29, 2018 5:04 PM at 5:04 pm said:

Andrew:

What you call the “abstract” of the paper is not actually labeled as the abstract in the paper — it is labeled “Significance”, and precedes the “Abstract”, which reads,

“The distribution of wealth in the United States and countries around the world is highly skewed. How does visible economic inequality affect well-off individuals’ support for redistribution? Using a placebo-controlled field experiment, I randomize the presence of poverty-stricken people in public spaces frequented by the affluent. Passersby were asked to sign a petition calling for greater redistribution through a “millionaire’s tax.” Results from 2,591 solicitations show that in a real-world-setting exposure to inequality decreases affluent individuals’ willingness to redistribute. The finding that exposure to inequality begets inequality has fundamental implications for policymakers and informs our understanding of the effects of poverty, inequality, and economic segregation. Confederate race and socioeconomic status, both of which were randomized, are shown to interact such that treatment effects vary according to the race, as well as gender, of the subject.”

Reply ↓
- Martha (Smith) on May 29, 2018 5:13 PM at 5:13 pm said:
  
  Since later you write,
  
  “I find it inappropriate that the abstract presents claims such as “in a real-world-setting exposure to inequality decreases affluent individuals’ willingness to redistribute. The finding that exposure to inequality begets inequality has fundamental implications . . .”, which does refer to a phrase in the abstract, I am guessing that the error pointed out above is just a copying error.
  
  Reply ↓
Peter S. on May 30, 2018 5:32 PM at 5:32 pm said:

Andrew:

I encourage you to take a look at Enos (2014), also published in PNAS. The design in that article appears to have inspired this piece, and the estimates are just as noisy. Although I appreciate the focus on mundane realism, I worry that the high cost of implementing these designs encourages fishing expeditions. Hell, I’d be upset if I devoted days/weeks of my life recruiting and training actors, only to find null effects with large confidence intervals.

Reply ↓
- Ryan Enos on May 31, 2018 6:10 AM at 6:10 am said:
  
  Andrew, I assume you’ve already seen Enos (2014), but if you haven’t I would also encourage you to check it out. Note though that Peter mischaracterizes the results. The topline findings in Table 1 are not what most people would describe as “noisy”. The confidence intervals in Figure 2 do overlap and I describe this as such in the article. Peter, note that the confederates in Enos (2014) were not actors, but were participants in the double-blind trial, so they weren’t trained at all. But, yes, that experiment did take weeks of my life, however given how much fishing seems to occur even with data that people simply download from the internet, I’m not sure that time devoted to research is a good predictor of fishing (and none occurred in the production of that article).
  
  Peter, Andrew and I had an email exchange about the Sands piece a few months ago when somebody emailed it to him. I told him, essentially, what Matt Blackwell wrote above. If you asked Melissa, the author, I’m sure she’d also agree that it should be replicated. In my humble opinion, Andrew is walking a bit of thin line on his criticism because he’s not saying it shouldn’t be published and not that the p-values obtained are invalid because of fishing, but simply that he would use different language to describe the results. This is a rather subjective critique because we all have different priors that will lead to different interpretations of the results. In fairness, yes, the author’s interpretations will probably be biased toward seeing something more concrete in the results then will the interpretation of somebody else. Of course, a reader of this blog, after Andrew publishes about an article, will probably tilt toward seeing nothing in the results. But this is the reason research in social science is never closed with a single article and we need to replicate research (including my own).
  
  The place I’d draw issue with Andrew’s post, as did Matt, is with the characterization of this as a case of “forking paths”. If choosing a certain interpretation of the pre-registered design is a “forking path”, then nearly everything is a “forking path”. I actually think that Andrew’s introduction of that term has been very valuable to social science researchers, so I’d encourage him not to apply it so widely that it loses meaning.
  
  The other place I’d draw issue is with saying the results are “situation dependent” as a criticism and a pathology of noisy data. It’s not clear to me why smaller p-values would make a relationship not situation dependent. We can’t know this unless we’ve observed the same relationship in many situtations. Moreover, situation-dependence is not a bug of research but a fact of the social world. Researchers shouldn’t claim their effects are generalizable without clear evidence but we also shouldn’t use situation dependence as a criticism of research. (Of course, the difficulty wit this philosophy is if research doesn’t replicate, it is not clear how to distinguish flimsy results obtained from p-hacking from situation dependence, but that’s why it is important that research be well-powered and pre-registered. It’s also why we should seek better theory about how behavor is situation dependent, so that we can systematically predict how behavior will change across situations, rather than using “situation dependence” as a generic critique.)
  
  Reply ↓
  - Andrew on May 31, 2018 9:24 AM at 9:24 am said:
    
    Ryan:
    
    I think the problem is that the data are just too noisy to learn anything from this experiment with its between-person design. There are lots of indications here: one indication is that the main comparison for which the study was designed was consistent with a zero effect; another indication is many interactions were examined and not much showed up there either; a third indication is that the main comparison, to the extent it was not simply noise, went in the opposite direction as expected and then the theory was changed to make this the desired result. Put it together and this looks to me like hopelessly noisy data plus a flexible theory that could be used to explain anything. Again, performing a replication would be fine, but I’d recommend thinking much more carefully about measurement before doing so, to avoid an infinite loop of noise chasing.
    
    Reply ↓
    - Ryan Enos on May 31, 2018 10:12 AM at 10:12 am said:
      
      Andrew, yes, I’d agree that thinking carefully about measurement and theory is important with replication. If the results were null or really noisy, we shouldn’t update the theory again to fit those results or it really would just be noise chasing, so having clear expectations ex ante is important.
      
      Flexible theory, of course, is a big problem in much social science. The advantage this research had, at least, was that it was pre-registered so we know that the theory was flexibly applied. We can then all update appropriately. For what it’s worth, after I saw these results, they made sense to me in light of other experiments on the subject – but, yes, that should be taken with a grain of salt because my naive expectation was that the results would be in the other direction. More research is needed…
      
      By the way, I like the term “noise chasing” – it captures an unfortunate amount of social science research.
      
      Hey, also, since Peter brought up my 2014 experiment, a group of researchers actually is trying to replicate it in Germany. I’m excited to see the results and it will be interesting to update our overall understanding of intergroup contact when those come in.
  - Vince S on May 31, 2018 10:32 AM at 10:32 am said:
    
    Ryan:
    
    I’ll try to say this politely as I can, but the difference between science and pseudoscience is for the researcher, at least to the extent humanly possible, to be aware of his own biases and how they can color his interpretation of the data, and to properly account for them. There is more to this than just computing p-values.
    
    And both you and Sands make conclusions unwarranted by the data. As you say, Researchers shouldn’t claim their effects are generalizable without clear evidence. Evidence, not prior bias (which means getting representative samples). It’s up to you whether you want to admit this in the name of doing better science, or dig in your heels and take a defensive posture. If it matters, I’ve made plenty of mistakes too. We all do.
    
    I read your 2014 paper. You do a one-tailed test and don’t correct for multiple comparisons in Table 1. If that’s what it takes to achieve significance then yes I think most people, including myself, would describe findings as “noisy”. (Granted, two-tailed Bonferroni-corrected significance would be barely obtained for the number of immigrants question.) Anyway, the proper conclusion is that results are consistent with a short-term effect of outgroup exposure on exclusionary attitudes, but future research will be necessary to see if the effect generalizes to different populations, and whether the effect lasts longer than a few days (it may well be the case that long-term exposure to outgroups in fact lessens exclusionary attitudes.) The data doesn’t “support” or “prove” or “show” that outgroup exposure affects exclusionary attitudes.
    
    I’d also like to point out that the question in “experiment” doesn’t match the question in “results”. The question in “experiment” is “Would you favor allowing persons that have immigrated to the United States illegally to remain in the country if they are employed and have no criminal history?” whereas it is described later as “Children of undocumented being allowed to stay?” Perhaps this was a typo, but what was the actual question?
    
    And again, regarding the Sands paper: It is not merely a question of using “different language” to describe the results, but of making conclusions and generalizations unwarranted by the data. Maybe it is technically not a forking path, but it is definitely HARKing (we got a result in the opposite direction from what we expected, so we’ll change the hypothesis to match.) The paper definitely does not “establish the causal effect of exposure to inequality on support for redistribution” with a result that doesn’t even meet standard p < 0.05 significance.
    
    I agree that p-hacking and situation dependence are separate issues and should be addressed separately, and situation dependence would still be an issue even if, beginning tomorrow, no researcher p-hacked ever again. But it is still true that both of you used the population in suburban Boston because it was convenient. You don't really know whether your results will generalize to, say, the Midwest, or Europe. All you have is the circular argument that your experiment "proves" the phenomenon you claim it does and that, therefore, your results should generalize.
    
    Reply ↓
    - Ryan Enos on May 31, 2018 11:23 AM at 11:23 am said:
      
      Vince, thank you for your interest in all of this. You may be surprised to learn that I agree with nearly everything you say and I invite you to send me an email if you’d like to discuss further. A couple of points just to clear up though: 1) speaking only for myself, I didn’t choose Boston suburbs purely out of convenience, but because of some components of the design that provided a limited number of places where the experiment could be conducted. 2) I actually don’t know if the results would be the same in the Midwest or Europe and I haven’t ever claimed to know this, so please don’t ascribe a “circular argument” to me that I have not made. I have some thoughts on it because how context affects behavior is something I have studied extensively. I do have some clear predictions about where the same results may or may not be likely to obtain and I’ve written about this elsewhere – my guess is that because Boston has low baseline levels of exclusionary attitudes and is highly segregated that the results are stronger in Boston than in a lot of other places. But, of course, I don’t know for sure. As I noted to Andrew above, some other researchers are trying to replicate the experiment in Germany, so maybe we’ll have a better sense of this in the future.

Statistical Modeling, Causal Inference, and Social Science

Some experiments are just too noisy to tell us much of anything at all: Political science edition

32 thoughts on “Some experiments are just too noisy to tell us much of anything at all: Political science edition”

Leave a Reply Cancel reply