Their findings don’t replicate, but they refuse to admit they might’ve messed up. (We’ve seen this movie before.)

Posted on August 20, 2020 9:15 AM by Andrew

Ricardo Vieira writes:

I have been reading the replication efforts by the datacolada team (in particular Leif Nelson and Joe Simmons). You have already mentioned some of their work here and here.

They have just published the #7 installation of the series, and I felt it was a good time to summarize the results for myself, specially to test whether my gut-feelings were actually in line with reality. I thought you and your readers might like to discuss it.

Most of these articles describe a spectacular failure to replicate the original effects (5-6 out of 7 studies), even with much larger sample sizes. However, what I find most interesting are not the results, but the replies from the original authors when asked to comment on the replication results.

The most obvious point is that nobody seems willing to say: ‘the proposed effect is probably not real’, even in face of some overwhelming evidence from the replication studies.

Although most authors seem to agree (in a very pondered manner) with the replication results, a substantial portion (4 out of 7) emphasize that they are happy/glad to see a trend (or subtrend) in the same direction as the original study, even when this is clearly not significant (either statistically or practically). A partially overlapping portion (4 out of 7) state a strong belief that their effect is real.

I think the public nature of this work and the fact that it is released slowly, poses an interesting dilemma for the authors of the future failed replications. Will they start to acknowledge the strong possibility that their effects are probably as illusory as the n-1 (give or take) that came before? Or will they increasingly believe that they are the exception to the norm? A third hypothesis is that involved authors will interpret the past collection of replications as generally more positive than either me or other people whose work is not on the line.

A fourth hypothesis is that the results are actually much more positive than I think they are. To keep myself in check, I compiled a short summary of the replication results and authors replies, that hopefully is not too detached from the source (I welcome any corrections to my classification). Here it is:

#1
Effect: “Consumers’ views of scenery from a high physical elevation induce an illusory source of control, which in turn intensifies risk taking”
Replication: Failed
Reply: Accepts results
Explanation: Things have changed
Link: https://datacolada.org/82

#2
Effect: “Independent consumers who choose on behalf of large groups tend to choose more selfishly, whereas everyone else tends to choose less selfishly”
Replication: Succeeded but then failed (after removing typo)
Reply: Effect is real. Happy for directional trend
Explanation: Study conducted in a different time of the year
Link: https://datacolada.org/83

#3
Effect: “Consumers with ‘low self-concept clarity’ are more motivated to keep their identities stable by (1) retaining products that are relevant to their identities, and (2) choosing not to acquire new products that are relevant to their identities”
Replication: Failed
Reply: Happy for directional trend
Explanation: Treatment may have been less effective. Decline in Mturk quality
Link: https://datacolada.org/84

#4
Effect: “Consumers induced to feel curious are more likely to choose the indulgent options (gym)”
Replication: Succeeded but then failed (after removing confound)
Reply: Effect is real. Intrigued by suggested confound
Link: https://datacolada.org/85

#5
Effect: “Consumers are more likely to use a holistic process (vs. an attribute-by-attribute comparison) to judge anthropomorphized products”
Replication: Failed
Reply: Happy for directional trend
Explanation: Things have changed. Decline in Mturk quality
Link: https://datacolada.org/87

#6
Effect: “Scarcity decreases consumers’ tendency to use price to judge product quality”
Replication: Inconclusive (differential attrition)
Reply: Effect is real. No problem in original data
Link: https://datacolada.org/89

#7
Effect: “Presenting multiple product replicates as a group (vs. presenting a single item) increases product efficacy perceptions because it leads consumers to perceive products as more homogenous and unified around a shared goal
Replication: Failed
Reply: Effect is real. Happy for directional trend
Explanation: Things have changed
Link: https://datacolada.org/90

Along similar lines, Fritz Strack writes:

Attached is a recent paper by Fabrigan, Wegener & Petty (2020) that discusses the “replication crisis” in psychology within the framework of different types of validity (Cook & Campbell, 1979). As it is very critical of the current movement focussing myopically on the statistical variant, I thought you might be interested in commenting on this publication.

My reaction to all this:

1. I’m impressed at how much effort Nelson and Simmons put into each of these replications. They didn’t just push buttons; they looked at each study in detail. This is a lot of work, and they deserve credit for it.

Some leaders of the academic establishment have said that people who do original studies deserve more credit than mere critics, as an original study requires creativity and can advance science, whereas a criticism is at best a footnote on existing work. But I disagree with that stance. Or, I should say, it depends on the original study and it depends on the criticism. Some original studies do advance science, while others are empty cargo-cult exercises that at best waste people’s time and at worst can send entire subfields into blind alleys, as well as burning up millions of dollars and promoting a sort of quick-fix Ted-talk thinking that can distract from real efforts to solve important problems. From the other direction, some critical work is thoughtless formal replication that sidesteps the scientific questions at hand, but others—such as those of Nelson and Simmons linked above—are deeply engaged.

Remember Jordan Anaya’s statement, “I know Wansink’s work better than he does, it’s depressing really.” That’s a not uncommon experience we have when doing science criticism and replication: The original study “worked,” so nobody looked at it very carefully. It’s stunning how many mistakes and un-thought-through decisions can be sitting in published papers. I know—it’s true of some of my published work too.

In the case of the datacolada posts, I’ve not read most of the original articles or the blogs, so I’ll refrain from commenting on the details. But just in general terms, I’ve seen lots of examples where a scientific criticism has more value than the work being criticized.

Sometimes. Not always. And often it’s debatable. For example, is Alexey Guzey’s criticism of Why We Sleep more valuable than Matthew Walker’s book? I don’t know. I really don’t. Yes, Walker makes errors and misrepresents data, and Guzey is contributing a lot by tracking down these details. Any future researcher wanting to follow up on Walker’s work should definitely read Guzey before going on, just to get a sense of the evidence really is. On the other hand, Walker put together lots of things in one place, and, even though his book is fatally flawed, it arguably is still making an important contribution. Sleep—unlike beauty-and-sex ratio, ovulation and voting, embodied cognition, himmicanes, etc.—is an important topic, and even though Why We Sleep misfires on many occasions, it may be making a real contribution to our understanding.

Anyway, I don’t know that the datacolada work will get “enough credit,” whatever that means, but in any case I appreciate it, and I say that even though I have at times expressed annoyance at their blogging style.

2. The big thing is that I agree with Vieira. At the very least, researchers should admit the possibility that they might have been mistaken in their interpretation of earlier results.

Look at it this way. Sometimes—many times—researchers go into a project strongly believing that their substantive hypothesis is true. In that case, fine, do a small between-person study and it’s very unlikely that the results will actually contradict your hypothesis. In that case, the mistake in the original paper is subtle, it’s the claim of strong evidence when there is no strong evidence. Then when the replication finds no strong evidence, the researchers remain where they started, believing in their original hypothesis. It’s hard for them to pinpoint what they did wrong, because they haven’t been thinking about the distinction between evidence and truth. From their point of view, they’ve broken some arbitrary rule—they’ve “p-hacked,” which is about as silly as the other arbitrary rule of “p less than 0.05” that they had to follow earlier. They see methodologists as like cops (or as our nudgelords would say, Stasi) and they care less about silly statistical rules and more about real science.

Other times researchers are surprised by their results. The data go against the researchers’ initial hypothesis. In this case, learning that the data analysis was flawed and the the result doesn’t replicate should cause a rethink. But typically it doesn’t, I think because researchers are all too good at taking an unexpected result and convincing themselves that it makes perfect sense.

This is a particularly insidious chain of reasoning: Data purportedly provide evidence supporting a scientific theory. But the finding doesn’t replicate and the data analysis was flawed. No worries: at this point the theory has already been established as truth. “You have no choice but to accept,” etc.

The theory has climbed a latter into acceptance. The ladder’s kicked away, but the theory’s still there.

3. The Fabrigar et al. paper seems fine in a general sense, but I don’t think they wrestle enough with the idea that effects and comparisons are much smaller and less consistent than traditionally imagined by many social researchers. To bring up some old examples, it’s a mistake to come into the analysis of an experiment with the expectation that women are 20 percentage points more likely to vote for Barack Obama during a certain time of the month, or that a small intervention on four-year-olds will increase later adult income by 40%. Statistics-based science is quantitative. Effect sizes matter.

30 thoughts on “Their findings don’t replicate, but they refuse to admit they might’ve messed up. (We’ve seen this movie before.)”

econotarian on August 20, 2020 1:20 PM at 1:20 pm said:

Replication #6 on scarcity and product quality judgments is an interesting case. The datacolada folks find a statistically significant effect in the replication study, and then do a bunch of non pre-registered ex post analyses in order to turn that statistically significant effect into a statistically nonsignificant one.

This is contrary to the their own recommendations to stick only to pre-registered analyses, and seems like upward p-hacking to get p above .05 and to declare a “failed” replication.

Reply ↓
econotarian on August 20, 2020 1:25 PM at 1:25 pm said:

To be clear, I am all for looking at the data and learning from it. As Andrew wrote in his post about Guzey
https://statmodeling.stat.columbia.edu/2020/05/26/alexey-guzeys-sleep-deprivation-self-experiment/

> the goal should be to learn, not to test hypotheses, and the false positive probability has nothing to do with anything relevant.

It is just that doing this is contrary to datacolada team’s own data analysis recommendations, so it is surprising to see them engaging in these practices. Not a good look, especially since their motivation is for failed replications.

Reply ↓
- Andrew on August 20, 2020 2:54 PM at 2:54 pm said:
  
  Econ:
  
  It’s too bad that datacolada doesn’t have a comment section and that the datacolada bloggers don’t engage with outside input. I can understand—maybe they don’t want to spend too much of their time on blogging issues—but this has happened before, that they make a controversial statement on their blog and then don’t engage with criticism. It’s frustrating. And I say this recognizing that this approach—to make strong statements and then not engage directly with criticism—is standard practice in academia.
  
  Reply ↓
- econotarian on August 20, 2020 3:22 PM at 3:22 pm said:
  
  Yes, Andrew, standard practice and not just for the data colada group–a very unfortunate situation.
  
  Another thing about the data colada motivation for failed replications. The studies they focus on seem to come from articles with multiple studies of the same underlying idea. It would seem the right way to assess replicability in this context would be to pool the all of the original studies as well as the data colada replication studies and other replication studies.
  
  Taking only a single study from a multi-study article and comparing it to only a single replication study as they do seems like a weak way to assess replication that is biased in favor of “failure.”
  
  Also, “the difference between ‘statistically significant’ and ‘not statistically significant’ is not itself statistically significant” (Gelman and Stern, 2006). That is, a statistically significant study and a statistically nonsignificant study are often not in conflict (as claimed in the data colada replications) in the sense that the difference between the two studies will often itself have a high p-value. I wonder how often this is the case with difference between the original study and the data colada replication study. Ignoring this seems like another way they bias things in the direction of “failure.”
  
  Reply ↓
  - John Bullock on August 22, 2020 10:03 AM at 10:03 am said:
    
    The studies they [Data Colada] focus on seem to come from articles with multiple studies of the same underlying idea. It would seem the right way to assess replicability in this context would be to pool the all of the original studies as well as the data colada replication studies and other replication studies. Taking only a single study from a multi-study article and comparing it to only a single replication study as they do seems like a weak way to assess replication that is biased in favor of “failure.”
    
    Joe Simmons’ reply is a good one, but I don’t think that it speaks directly to this passage. I see two objections in the passage. First, the Data Colada people should pool results from many studies: pool the original studies and the replication studies, and analyze the pooled data. Second, the Data Colada people shouldn’t attempt to replicate just one study from any large set of studies.
    
    These objections seem reasonable to me, but I tend to think that they shouldn’t carry much weight. To begin, the sort of pooling suggested here — “pool the all of the original studies as well as the data colada replication studies and other replication studies” — would almost certainly lead to objections from the original authors. They would be able to dismiss the analysis on the grounds that dissimilar studies were pooled. I am not saying that these objections would always be legitimate; I am saying that they would arise even when illegitimate, and that they would be a diversion.
    
    Note too that even when Simmons and his colleagues run only a single replication study, its sample size sometimes approaches or exceeds the combined sample size of all of the original authors’ studies. (Perhaps it always does. I haven’t thoroughly checked this point.) Pooling would therefore not weight the original studies as much as one might imagine.
    
    As far as replicating more studies is concerned, it’s important to recognize that the Data Colada people focus on studies that were done via Mechanical Turk. And they attempt to replicate those studies via Mechanical Turk. Given constraints on the money and time of the Data Colada people, this approach seems wise. Drawing subjects from the same source that the original authors used helps to pre-empt one of the most common criticisms of replication studies: “Your subjects came from X, but my subjects came from Y.” It would be difficult to pre-empt this objection if the Data Colada people were trying to replicate lab studies that were conducted with different kinds of people in different parts of the country.
    
    Choosing to replicate online studies, as the Data Colada people do, is also smart because it’s easier to maintain fidelity to the original authors’ designs when working with online studies. Even if you always use the original authors’ materials, there are inevitably more “researcher degrees of freedom” in lab studies than in most online studies.
    
    Reply ↓
    - econotarian on August 22, 2020 3:42 PM at 3:42 pm said:
      
      I am glad you find my objections are reasonable. You bring up some practical considerations that show how truly limited what the data colada folks do is (and which in the end do not undermine those objections). Thanks for raising these and making me aware.
- Joe Simmons on August 21, 2020 1:35 PM at 1:35 pm said:
  
  Thanks for the feedback on Data Colada. A few points.
  
  First, our motivation is not simply to report failed replications, and we sincerely wish that more of these had succeeded. Personally, I am looking forward to, and in fact desperate for, the day when we can confidently declare that something replicated on Data Replicada. It is so much more fun when surprising things replicate than when they don’t.
  
  Second, our take on Replicada #6 is (predictably) a bit different. It is true that we initially replicated the original results. But at that point we wanted to figure out why those results replicated, and we discovered that (1) there was massive differential attrition, and (2) it seemed to be responsible for the results. If we would have failed to report this problem with the data, it would have been misleading at best and dishonest at worst. You are right that those analyses were ex post, but that’s why we ran two additional studies. In both of them we replicated the differential attrition result that we found in the initial replication. So we are confident that we didn’t just find differential attrition because we were looking to find any kind of problem. We spent tons of time and a fair bit of money ensuring that that problem is a replicable problem (and that the findings are not-so-replicable). We were also careful to emphasize that our efforts were inconclusive with respect to the original study, as it is unclear whether that study suffered from the same problems (and to the same degree) that we encountered.
  
  Third, the goal of Data Replicada is not just to assess whether original findings are replicable. It is also to show how doing replications can teach us many things about about how social science is really done, and how to do science better going forward. Sometimes you learn that the key test has an inflated false-positive rate (Replicada #2) or that the original study contained a hidden confound (emphasizing again that materials should always be made available; Replicada #4) or that differential attrition can be a massive problem when you manipulate the kind of writing task you ask people to do (Replicada #6). These are all lessons that I appreciate much more now than I did before, and I hope they are lessons that help other researchers/editors/reviewers going forward. Undoubtedly, as we continue to do this, we will learn more things that help us to better conduct and evaluate research going forward.
  
  Finally, I’m sorry we don’t allow comments, but we think the costs outweigh the benefits. We do (uniquely) always ensure that authors have ample time to give us feedback on our posts in advance, so we don’t say anything out of line or unambiguously inaccurate. And we are always happy to engage with those who email us about our posts. I know this won’t be fully satisfying for everyone.
  
  Thanks for reading.
  
  Reply ↓
  - Andrew on August 21, 2020 2:18 PM at 2:18 pm said:
    
    Joe:
    
    Thanks for the comment.
    
    I agree especially with your point 3, “the goal of Data Replicada is not just to assess whether original findings are replicable. It is also to show how doing replications can teach us many things about about how social science is really done, and how to do science better going forward.” The whole thing of “is it real or is it not,” “did it replicate or did it not,” etc., is a trap, because it attempts to compress scientific understanding into a single noisy bit of information. Literally one bit! I appreciate that in your replications you went beyond that attitude.
    
    Regarding feedback on your posts . . . sometimes yes, sometimes no. For example, last year you ran a post and asked for feedback. I gave you feedback but you said you didn’t want to post that feedback on your blog. So I posted something here. On your blog you did not acknowledge or link to these comments, nor did you or your co-bloggers respond on the thread here. So that’s a case where you did not engage. I’m not saying you have a duty to engage with critics—it’s your call how you want to budget your time—but from my perspective it was frustrating that you asked for feedback and then didn’t do anything with the feedback that was given to you.
    
    Reply ↓
    - Joe Simmons on August 21, 2020 4:36 PM at 4:36 pm said:
      
      Thanks Andrew.
      
      We have thought long and hard about our Feedback Policy. At the time of that post, our policy (from October 2017-November 2019) was to allow authors to give feedback to us, but not to give all authors a chance to write a response. This is because we noticed that some authors were not giving us honest feedback ahead of publication about, say, an inaccuracy, but were instead withholding that feedback so that they could write about it in their responses. In other words, someone would deny pointing out a (fixable) mistake in our post so that they could write about that mistake in their response. That was our rationale then for not linking to responses. And over email we were very open about that being our rationale.
      
      Nevertheless, and at least partially motivated by that exchange, we have recently (in November 2019) revised our feedback policy and we now do allow authors to write a post. Our newer feedback policy is better (https://datacolada.org/feedback_policy), and here’s the relevant quote:
      
      “In some circumstances, authors may also want to write a response to our post. We will link to any response at the end of our post, either when the post goes live or, if we do not receive it within 7 days, whenever the authors do provide it.
      
      If authors’ written responses reveal something we wished they had provided as feedback ahead of time, that is, anything inaccurate, misleading, confusing, snarky, or poorly worded in our post, we may change the post to address these problems. We hope this motivates authors to give us feedback about any such problems before they write a response.”
      
      I’m sorry this policy wasn’t in place for you back in April 2019, when our post about heterogeneity became public. We are not perfect but we do try to improve, and we believe the new policy is better.
    - Andrew on August 21, 2020 4:43 PM at 4:43 pm said:
      
      Good to know!
    - Andrew on August 22, 2020 8:24 AM at 8:24 am said:
      
      P.S. It’s still not too late for you to add a link in that post to our discussion here. It still seems odd to me that you contacted me asking for feedback, I gave you feedback, and then you didn’t even link to it. Again, it’s your call but it seems like a strange thing to do.
  - econotarian on August 21, 2020 7:03 PM at 7:03 pm said:
    
    Joe Joe says his motivation is not to report failed replications, but this is belied by his comment
    
    > It is true that we initially replicated the original results. But at that point we wanted to figure out why those results replicated…You are right that those analyses were ex post.
    
    Joe and his data colada colleagues never try to figure out why results do _not_ replicate. If they did, they might figure out why and then learn the effect was replicable in the first place thereby undermining their aim.
    
    This asymmetric treatment of results that do versus do not replicate is hypocritical and shows their true motivations.
    
    And his comment that they do not allow comments because “we think the costs outweigh the benefits” is a nonresponse. Obviously it is their blog and they can do what they want, but that does not exemplify the claimed happiness to engage.
    
    Reply ↓
    - Joe Simmons on August 21, 2020 11:39 PM at 11:39 pm said:
      
      1. We do everything in our power to make sure that failed replications are informative. If they fail, it is not because our sample is too small or because we did not faithfully replicate the methods/sample of the original study, since we effectively collaborate with original authors to design these replications (and in many cases use their exact materials). We do all of the hard work to ensure that a failure indicates that the original study is not (currently) replicable at the (much larger) sample size that we have used, using the same methods/sample that the original authors used and/or told us to use. Given this, to further chase down exactly *why* a replication failed would in many cases be impossible or at least prohibitively expensive or intrusive, as you can’t prove that an effect is zero (or, without tons of money/subjects, extremely close to zero) and you cannot observe either what an original author might have (perhaps unwittingly) done to capitalize on chance or whether s/he just got “lucky”. Mostly what we can say is exactly what we do say in almost every case of failure: If the effect exists it seems like you can’t reliably detect it using the sample size of the original study. We would love for an original author to say, “You did X wrong. If you instead run the study under Condition Y, the effect will emerge.” So far as we have understood, none have said that.
      
      2. Nevertheless, if you read our Data Replicada series in detail, you will notice that we often do not stop after a single failure. For example, in Replicada #5, we failed to replicate the original result, but then thought that maybe it might replicate if we use a Prolific sample. So we tried that, and the results were the same. Similarly, in Replicada #7, we tried *three* times to get the effect. Whenever there is any sort of plausible and testable reason why a replication might have failed or succeeded, we usually try to chase it down (if we can afford to do so). We did exactly that in Replicada #4. Our goal is to get it right, with the provision that we do not have infinite time and resources to endlessly explore the ever-elusive cause of every failure.
      
      3. When we do replicate a result (as in Replicada #6), it is very important to assess whether the authors’ theory is correct or whether there is some other explanation for it. It would be extremely irresponsible of us to have just said, “It replicates! You can believe their paper” when actually the cause of the effect in our replication was due to differential attrition. This is not just something I do when I observe a successful replication of others’ work. It is also something I do whenever my own study yields a positive result, especially when I have never observed that result before. For example, I now am trained to try to make sure that any positive result I see in my own research is not caused by differential attrition. And I’ve long been trained to just directly replicate any surprising result that I observe, and to make sure that the effect is not driven by some arbitrary procedural detail or selection of stimuli. An effect is worthy of reporting only when it holds up in the face of attempts to eradicate or trivialize it. When it is battle tested. All researchers should try to poke holes in their own positive results, preferably early in a project, before you get too attached. A battle-tested effect is an effect worth studying.
      
      4. If you, econotarian, have a testable hypothesis for why any of our Replicadas have failed, we would love to hear your *very specific feedback* about how to proceed. I suggest you email us ([email protected]) so we can have the kind of civil, in-good-faith, non-anonymous discussion that has a chance to be productive, rather than the kind of discussions that are more typical of the comments sections of blog posts.
      
      5. Finally, I would like to call your attention to an excellent Data Colada post written by Uri Simonsohn a few years ago, about how to engage in civil and productive scientific discourse: https://datacolada.org/52. His “Idea 2: Don’t speculate about motives” is a good one.
    - econotarian on August 22, 2020 3:34 PM at 3:34 pm said:
      
      Thanks for taking the time to respond to a lowly reg monkey like myself, Joe. There is a lot to like in what you write.
      
      However, your insistence on taking this conversation private inhibits the kind of civil and good faith discussion you claim to want. As does not linking to Andrew’s feedback that you solicited as he points out in the comment directly above.
      
      Anyway, I’m sorry for speculating about your motives and agree with Uri’s comment about that. However, since you now raise this point, I will not speculate but will simply ask:
      
      Are you saying that you and your data colada coauthors are not in fact motivated to find failed replications?
      
      My impression as an outsider was the whole JDM stuff was not really gaining much traction, and you guys really made a name for yourselves and became notorious for pointing to failed replications and emphasizing “false positives.” It thus wouldn’t look so good for you if you found that time and time again effects replicated.
    - Joe Simmons on August 25, 2020 11:09 AM at 11:09 am said:
      
      Thanks for the apology.
      
      I *prefer* to have these exchanges over email not because they are private, but because they are usually more polite, honest, and productive. But I am not *insisting* that this exchange be private…
      
      As for your question about whether we “want” the replications to fail, it’s complicated…
      
      First, do I want the field of marketing to have a replicability problem? No, absolutely not. If I could snap my fingers and make all of the replicability problems go away, I would do so without hesitation. A field without replicability problems is what we’ve been fighting for: a more just field, a field that rewards good true careful science more than sexy false slipshod science. People who only know me for my methods work might say, “yeah, but wouldn’t you be harmed by this kind of development?” No. It would be better for my students, better for my mental health, and no worse for my career. I’d be able to focus on why I got into this business in the first place – the discovery and interpretation of facts about human beings – rather than this strange unplanned detour that, although I acknowledge being one of the luckiest people on the planet, has itself imposed far more personal costs than benefits. I am very proud of the methods work I have done these past 10 years in collaboration with Leif and Uri, but I would breathe a huge sigh of relief if everything were suddenly fine.
      
      But the field is almost certainly not fine. And if it isn’t fine, I want to find out and reveal that, so that the field is motivated to improve. So the real answer is something you may not find to be satisfying: I want replicable effects to replicate and non-replicable effects to not replicate. If we are replicating effects that do not exist (or are too small to study at our sample sizes), I want our results to be consistent with that truth. If we are replicating effects that do exist, I want our results to consistent with *that* truth.
      
      I suppose that one could accuse us of trying to replicate studies that are unlikely to replicate, though in a healthy field that should be impossible to successfully achieve. Well, on the one hand, we do avoid choosing studies which are likely to replicate because their results are potentially caused by artifacts, demand, or (visible) confounds. But otherwise, we set out to choose (online) studies that are representative of the field’s contributions. We have laid out our criteria for study selection here: https://datacolada.org/81. Two points are worth emphasizing. First, by far the most important criterion we rely on to select a study is whether a replication would be informative both if it succeeds and if it fails. For example, we recently pulled the plug on an attempt because we decided a failed replication of that study would not be informative in the context of the pandemic. Second, within a paper, we try to select the online study with the strongest evidence. Obviously, that’s a bias that should lead to more successes.
      
      Another way to approach this question is to focus on our personal incentives in conducting these replications. Which mistake is a worse one for us to make (from a self-interested perspective): to falsely conclude that a study is replicable when it isn’t or to falsely conclude that a study is not replicable when it is? Clearly, for us, the second error is much worse. We are aware that being critical makes us less popular. Despite what it might seem, we don’t enjoy that at all. We try to mitigate our unpopularity by always striving to be polite, by giving the last word to authors whose work we are trying to replicate, and by going to the greatest lengths possible to be accurate. We know that people are ready to pounce on any error we make, especially if we were to err on the side of saying that someone is wrong when they are not. This means that after we observe a failure, we are very motivated to consider why we might be mistaken. That’s a big reason why we have often attempted additional replications even after the first one failed. That’s also a big reason why we are very careful about how we frame our conclusions. We need to be right.
      
      But we are not merely selfish actors. We are human beings who have empathy for other human beings. And in all of our Data Replicada experiences thus far, we have had a good rapport with the original authors, and I almost always find myself rooting for the opportunity to tell those authors that we are going to report a successful replication. It almost always feels bad to email these nice folks with a draft of a post that indicates that their finding did not replicate. And so it creates this conflicting motivation, where on the whole we do want to reveal that a truly problematic field is indeed problematic, but we are never excited to find that an individual study does not replicate.
      
      And this gets to the last thing I want to say, and this is a thing that gets me in trouble with some of those who are more zealous about all of this than I am. I think that most of the folks who publish non-replicable findings in the field of marketing are *victims* of a reward structure and a set of journal policies and operations that push them to engage in the subtle, personally justifiable, but ultimately harmful research practices that produce false-positive findings. Try being a behavioral marketing professor who pre-registers your studies, discloses all of your methodological details, and posts your materials and data. You are at a meaningful disadvantage, because you are not going to be able to routinely generate the kinds of novel findings and complex theories that the two most important journals are eager to publish. The people working within this system have every incentive to follow (and believe in) the methodological norms that everyone else follows (and that they were trained to follow).
      
      So failing to replicate individual studies often feels shitty, because I don’t largely/usually blame the authors, but rather the journal editors and editorial boards, who actually have the power to put rules in place that incentivize everyone to put in the hard, careful work required to produce true, often-non-earth-shattering findings that advance our understanding of human behavior. But what are we supposed to do with a field controlled by people who blindly assume or insist that all of its non-obvious findings are replicable? Are we supposed to just believe them, or doesn’t someone have an obligation to actually test whether those findings are replicable? As Simine Vazire and others frequently point out, science is not *self*-correcting. Someone actually needs to do the correcting. For better or for worse, I can’t help myself.
    - Martha (Smith) on August 22, 2020 5:09 PM at 5:09 pm said:
      
      Joe said,
      “All researchers should try to poke holes in their own positive results, preferably early in a project, before you get too attached. A battle-tested effect is an effect worth studying.”
      
      Yes, yes, and yes!
    - Martha (Smith) on August 22, 2020 5:12 PM at 5:12 pm said:
      
      See also here: https://statmodeling.stat.columbia.edu/2020/08/22/david-spiegelhalter-wants-a-checklist-for-quality-control-of-statistical-models/#comment-1419542
Anoneuoid on August 20, 2020 2:55 PM at 2:55 pm said:

Every study needs to be replicated by default, and this has nothing to do with whether an effect is “real”. It has to do with learning all the important conditions of the experimental situation that must be communicated to get a reproducible method.

Reply ↓
- [email protected] on August 20, 2020 2:59 PM at 2:59 pm said:
  
  Excellent Excellent Excellent
  
  Reply ↓
Adede on August 20, 2020 4:48 PM at 4:48 pm said:

What is the definition of a “failed” replication? If there are 6 studies, each with a p=.06 effect in the same direction as the original, then collectively they point to the original results being true (but maybe less strong than initially suspected).

Reply ↓
- Paul Owen on August 20, 2020 6:11 PM at 6:11 pm said:
  
  Excellent point. Using Fisher’s method on six p = 0.06 studies, we get ChiSq(df=2*6) = 6 * [-2 log(0.06)] = 33.76 or p = 0.0007. Seems like really strong evidence and that is still ignoring that provided by the original study(ies)!
  
  Reply ↓
  - Nick Adams on August 20, 2020 6:22 PM at 6:22 pm said:
    
    Sure, it’s strong evidence for an effect in the same direction but replication also requires confirmation of a similar effect size. A 2-sided p-value against a null of no effect is a test of existence that tells you nothing about the magnitude of the effect.
    
    Reply ↓
    - Paul on August 20, 2020 7:27 PM at 7:27 pm said:
      
      Absolutely: if effects sizes vary substantially from study to study, the effect does not seem very replicable even if the sign is consistent. In fact, speaking of _the_ effect seems kind of nonsensical in this scenario.
    - Adede on August 21, 2020 2:18 PM at 2:18 pm said:
      
      Of course, but where do you draw the line? If a replication finds an effect size N% as big as the original study, you’d call it a success, but finding an effect size (N-1)% as big gets the study stamped with a big red “FAIL”? Ultimately, measuring the effect size doesn’t fit well into a dichotomous replicated/not-replicated framework.
    - Paul Owen on August 21, 2020 6:50 PM at 6:50 pm said:
      
      Yes, exactly — it’s not a dichotomous matter!
Renzo Alves on August 20, 2020 11:29 PM at 11:29 pm said:

As someone who spent 21 years in the field of Personality & Social Psych. I can tell you that there is strong pressure to present positive and “striking” results, while replications or failures to replicate are explicitly dis-valued (as publishable articles in the better journals at least. Replications and failed replications should be, if at all, bundled into a multi-studies with “positive striking results.”

Reply ↓
- Martha (Smith) on August 21, 2020 5:06 PM at 5:06 pm said:
  
  Not pleasant to hear, but worthwhile to communicate. I hope things are at least beginning to change.
  
  Reply ↓
  - dl on August 21, 2020 6:03 PM at 6:03 pm said:
    
    Political scientist here: they are not.
    
    Reply ↓
A.P. Salverda on August 21, 2020 5:15 PM at 5:15 pm said:

It’s amusing to see that many authors of the original studies are “happy to see a trend in the same direction”. I wonder how many of them understand that if there is no true effect, chances that a replication study will find an effect “in the same direction” as the original study are 50%. Blessed is the researcher whose happiness is derived from a coin flip coming up heads.

Reply ↓
I've sat in psychological experiments too much on August 24, 2020 3:12 AM at 3:12 am said:

I just performed a conceptual replication (N = 2), this was a within-subjects style study in which the participants had to judge if either the product seemed “more effective” in one of the ads. The participants were naïve to the results of the original studies.

Both came to the conclusion that the product seemed more effective in the ad with a single bottle. Rationale was along the lines that the ad with a single bottle “seems kind of more professional, in the ad with multiple bottles it seems like it is just some sort of bulk product”.

Anyway, I think it’s important to remember that calling variable “perceived effectiveness” doesn’t mean that it measures “perceived effectiveness”. This is a common fallacy in psychological research, or in this case, consumer research. When pressed with question such as “whaddaya think of these ads?” it’s easy to come up with all sorts of pseudo-narratives that help answering the question, as happened in my humorous “replication”: sure, when thinking about differences between the ads, I can come up with something like “the one with single bottle seems more professional” and choose that ad as the one in which the product “seems more effective” but that does not really mean that my “perceived effectiveness” of the product would be different. Perceived effectiveness does not exist.

Also, I don’t think this would in any realistic way affect my real-world decisions regarding which cleaning product I’d buy, which, I guess, is the main motivation behind this sort of studies.

Going back to the “pseudo-narratives”, in my experience it is actually psychologically almost necessary to come up with those in this kind of studies. I’ve wasted quite a bit of my time in all sorts of psychological experiments in which I’ve had to give my opinion on things that I really don’t care about. The only way to survive that is to come up with a story and kind of use that as a heuristic for the responses. This does not mean that there wouldn’t be difference in the variables of interest; I can say that in my case some kind of effect WAS replicated, it was in the other direction, but still. The problem just is that this doesn’t mean anything beyond the point that I can come up with some yadda-yadda idea about a single bottle seeming more professional and picking that as the one that seems more effective.

It never ceases to amaze me how stuck researchers are in the age-old “input-output” model of psychology. All models are wrong, yes, but they should also be useful, as that old saying continues. I think this kind of research in which participants are fed an input of stimuli, in this case ads with differing amounts of bottles, and expecting to get some sort of constant output that would tell us anything useful about the cognitive processess of interest or, even more desperately, real-world consuming is so wrong that it is not useful in any sense.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Their findings don’t replicate, but they refuse to admit they might’ve messed up. (We’ve seen this movie before.)

30 thoughts on “Their findings don’t replicate, but they refuse to admit they might’ve messed up. (We’ve seen this movie before.)”

Leave a Reply Cancel reply