Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.

Posted on December 13, 2018 9:40 AM by Andrew

I have a sad story for you today.

Jason Collins tells it:

In The (Honest) Truth About Dishonesty, Dan Ariely describes an experiment to determine how much people cheat . . . The question then becomes how to reduce cheating. Ariely describes one idea:

We took a group of 450 participants and split them into two groups. We asked half of them to try to recall the Ten Commandments and then tempted them to cheat on our matrix task. We asked the other half to try to recall ten books they had read in high school before setting them loose on the matrices and the opportunity to cheat. Among the group who recalled the ten books, we saw the typical widespread but moderate cheating. On the other hand, in the group that was asked to recall the Ten Commandments, we observed no cheating whatsoever.

Sounds pretty impressive! But these things all sound impressive when described at some distance from the data.

Anyway, Collins continues:

This experiment has now been subject to a multi-lab replication by Verschuere and friends. The abstract of the paper:

. . . Mazar, Amir, and Ariely (2008; Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the 10 Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Consistent with the self-concept maintenance theory . . . moral reminders reduced cheating. The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786), all of which followed the same pre-registered protocol. . . .

And what happened? It’s in the graph above (from Verschuere et al., via Collins). The average estimated effect was tiny, it was not conventionally “statistically significant” (that is, the 95% interval included zero), and it “was numerically in the opposite direction of the original study.”

As is typically the case, I’m not gonna stand here and say I think the treatment had no effect. Rather, I’m guessing it has an effect which is sometimes positive and sometimes negative; it will depend on person and situation. There doesn’t seem to be any large and consistent effect, that’s for sure. Which maybe shouldn’t surprise us. After all, if the original finding was truly a surprise, then we should be able to return to our original state of mind, when we did not expect this very small intervention to have such a large and consistent effect.

I promised you a sad story. But, so far, this is just one more story of a hyped claim that didn’t stand up to the rigors of science. And I can’t hold it against the researchers that they hyped it: if the claim had held up, it would’ve been an interesting and perhaps important finding, well worth hyping.

No, the sad part comes next.

Collins reports:

Multi-lab experiments like this are fantastic. There’s little ambiguity about the result.

That said, there is a response by Amir, Mazar and Ariely. Lots of fluff about context. No suggestion of “maybe there’s nothing here”.

You can read the response and judge for yourself. I think Collins’s report is accurate, and that’s what made me sad. These people care enough about this topic to conduct a study, write it up in a research article and then in a book—but they don’t seem to care enough to seriously entertain the possibility they were mistaken. It saddens me. Really, what’s the point of doing all this work if you’re not going to be open to learning?

(See this comment for further elaboration of these points.)

And there’s no need to think anything done in the first study was unethical at the time. Remember Clarke’s Law.

Another way of putting it is: Ariely’s book is called “The Honest Truth . . .” I assume Ariely was honest when writing this book; that is, he was expressing sincerely-held views. But honesty (and even transparency) are not enough. Honesty and transparency supply the conditions under which we can do good science, but we still need to perform good measurements and study consistent effects. The above-discussed study failed in part because of the old, old problem that they were using a between-person design to study within-person effects; see here and here. (See also this discussion from Thomas Lumley on a related issue.)

P.S. Collins links to the original article by Mazar, Amir, and Ariely. I guess that if I’d read it in 2008 when it appeared, I’d’ve believed all its claims too. A quick scan shows no obvious problems with the data or analyses. But there can be lots of forking paths and unwittingly opportunistic behavior in data processing and analysis; recall the 50 Shades of Gray paper (in which the researchers performed their own replication and learned that their original finding was not real) and its funhouse parody 64 Shades of Gray paper, whose authors appeared to take their data-driven hypothesizing all too seriously. The point is: it can look good, but don’t trust yourself; do the damn replication.

P.P.S. This link also includes some discussions, including this from Scott Rick and George Loewenstein:

In our opinion, the main limitation of Mazar, Amir, and Ariely’s article is not in the perspective it presents but rather in what it leaves out. Although it is important to understand the psychology of rationalization, the other factor that Mazar, Amir, and Ariely recognize but then largely ignore—namely, the motivation to behave dishonestly—is arguably the more important side of the dishonesty equation. . . .

A closer examination of many of the acts of dishonesty in the real world reveals a striking pattern: Many, if not most, appear to be motivated by the desire to avoid (or recoup) losses rather than the simple desire for gain. . . .

The feeling of being in a hole not only originates from nonshareable unethical behavior but also can arise, more prosaically, from overly ambitious goals . . . Academia is a domain in which reference points are particularly likely to be defined in terms of the attainments of others. Academia is becoming increasingly competitive . . . With standards ratcheting upward, there is a kind of “arms race” in which academics at all levels must produce more to achieve the same career gains. . . .

An unfortunate implication of hypermotivation is that as competition within a domain increases, dishonesty also tends to increase in response. Goodstein (1996) feared as much over a decade ago:

. . . What had always previously been a purely intellectual competition has now become an intense competition for scarce resources. This change, which is permanent and irreversible, is likely to have an undesirable effect in the long run on ethical behavior among scientists. Instances of scientific fraud are almost sure to become more common.

Rick and Loewenstein were ahead of their time to be talking about all that, back in 2008. Also this:

The economist Andrei Shleifer (2004) explicitly argues against our perspective in an article titled “Does Competition Destroy Ethical Behavior?” Although he endorses the premise that competitive situations are more likely to elicit unethical behavior, and indeed offers several examples other than those provided here, he argues against a psychological perspective and instead attempts to show that “conduct described as unethical and blamed on ‘greed’ is sometimes a consequence of market competition” . . .

Shleifer (2004) concludes optimistically, arguing that competition will lead to economic growth and that wealth tends to promote high ethical standards. . . .

Wait—Andrei Shleifer—wasn’t he involved in some scandal? Oh yeah:

During the early 1990s, Andrei Shleifer headed a Harvard project under the auspices of the Harvard Institute for International Development (HIID) that invested U.S. government funds in the development of Russia’s economy. Schleifer was also a direct advisor to Anatoly Chubais, then vice-premier of Russia . . . In 1997, the U.S. Agency for International Development (USAID) canceled most of its funding for the Harvard project after investigations showed that top HIID officials Andre Shleifer and Jonathan Hay had used their positions and insider information to profit from investments in the Russian securities markets. . . . In August 2005, Harvard University, Shleifer and the Department of Justice reached an agreement under which the university paid $26.5 million to settle the five-year-old lawsuit. Shleifer was also responsible for paying $2 million worth of damages, though he did not admit any wrongdoing.

In the above quote, Shleifer refers to “conduct described as unethical” and puts “greed” in scare quotes. No way Shleifer could’ve been motivated by greed, right? After all, he was already rich, and rich people are never greedy, or so I’ve heard.

Anyway, that last bit is off topic; still, it’s interesting to see all these connections. Cheating’s an interesting topic, even though (or especially because) it doesn’t seem that it can be be turned on and off using simple behavioral interventions.

44 thoughts on “Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.”

Garnett on December 13, 2018 10:58 AM at 10:58 am said:

“The above-discussed study failed in part because of the old, old problem that they were using a between-person design to study within-person effects….”

I wonder if the veracity of this statement depends on the objective of the study.

Suppose you have a room full of cheaters and you want to apply a “treatment” to reduce cheating. Isn’t the idea to pick the “treatment” (e.g. recall books or recall ten commandments) that offers the best predicted outcome (however defined)? If I read Gelman and Hill p.173 correctly, then the randomized design offers the opportunity to estimate the average causal effect so that, presumably, one can predict the difference in outcomes in a new patient population under either treatment regimen.

I don’t think the goal is to take a room full of people who recall either ten commandments or ten books and see how their cheating might change with a change in treatment regimen.

Reply ↓
Ed Hagen on December 13, 2018 11:31 AM at 11:31 am said:

I think the replication crisis has a pretty simple cause: we academics have based our career success on sexy outcomes, but we can’t control our outcomes, so we p-hack. More here:

https://grasshoppermouse.github.io/2017/12/05/academic-success-is-either-a-crapshoot-or-a-scam/

We need to change the system so academic researchers are rewarded for running high quality studies, regardless of outcome. That would align researcher incentives with scientific progress. Under the current system where our professional standing is based on outcomes, it’s no wonder folks defend their outcomes to the death.

Reply ↓
- Andrew on December 13, 2018 11:37 AM at 11:37 am said:
  
  +1
  
  Reply ↓
- Sameera Daniels on December 13, 2018 2:23 PM at 2:23 pm said:
  
  I’m not sure how sexy these outcomes are to begin with. In some cases they verge on silly.
  
  Reply ↓
  - Martha (Smith) on December 13, 2018 10:16 PM at 10:16 pm said:
    
    +1
    
    Reply ↓
  - JFA on December 14, 2018 10:13 AM at 10:13 am said:
    
    You have to control for the underlying average sexiness of results in academia. Then it all makes sense!
    
    Reply ↓
  - Terry on December 14, 2018 10:19 AM at 10:19 am said:
    
    I had that suspicion while flipping through the Mazar et al. paper.
    
    Maybe psychology is gaining profound insights by meticulously exploring the intricate workings of the human psyche, or maybe psychology is just finding verbose ways to say obvious things. (It is easy for an ignorant outsider to mock a field’s jargon. … That’s why I do it so often. I can feel intellectually superior without any real expenditure of effort.)
    
    Here is a paragraph from the paper that seems to be breathlessly discovering the concept ignorant outsiders call “conscience.”
    
    Applied to the context of (dis)honesty, we propose that
    one major way the internal reward system exerts control
    over behavior is by influencing people’s self-concept—that
    is, the way people view and perceive themselves (Aronson
    1969; Baumeister 1998; Bem 1972). Indeed, it has been
    shown that people typically value honesty (i.e., honesty is
    part of their internal reward system), that they have strong
    beliefs in their own morality, and that they want to maintain
    this aspect of their self-concept (Greenwald 1980; Griffin
    and Ross 1991; Josephson Institute of Ethics 2006; Sanitioso,
    Kunda, and Fong 1990). This means that if a person
    fails to comply with his or her internal standards for honesty,
    he or she will need to negatively update his or her selfconcept,
    which is aversive. Conversely, if a person complies
    with his or her internal standards, he or she avoids such
    negative updating and maintains his or her positive selfview
    in terms of being an honest person. Notably, this
    perspective suggests that to maintain their positive selfconcepts,
    people will comply with their internal standards
    even when doing so involves investments of effort or sacrificing
    financial gains (e.g., Aronson and Carlsmith 1962;
    Harris, Mussen, and Rutherford 1976; Sullivan 1953). In
    our gas station example, this perspective suggests that
    people who pass by a gas station will be influenced not
    only by the expected amount of cash they stand to gain
    from robbing the place, the probability of being caught, and
    the magnitude of punishment if caught but also by the way
    the act of robbing the store might make them perceive
    themselves
    
    Reply ↓
    - Jonathan (another one) on December 14, 2018 3:27 PM at 3:27 pm said:
      
      You always get extra credit here for citing Bem.
    - Andrew on December 14, 2018 3:30 PM at 3:30 pm said:
      
      Jonathan:
      
      I think Bem wrote that paper in 1972 because he knew that it would be referred to in that way 45 years later.
- Mark White on December 13, 2018 10:02 PM at 10:02 pm said:
  
  Success being “crapshoot or scam” is a really good way to put it. It is exactly how I used to feel. When I was in a PhD program, I ran a lot of larger N studies on MTurk because I wasn’t going to p-hack, so I knew I had to run a lot of experiments—and conceptual replications of these experiments—quickly to get enough publications to get a job. Eventually, I got tired of it. It was too stressful, ruined the fun of research, and so I didn’t go into academia after defending my dissertation (along with other issues I had with academic jobs).
  
  Reply ↓
- Anonymous on December 14, 2018 4:43 PM at 4:43 pm said:
  
  See here for a recent blog post about “the incentive narrative” that has been brought forward by some folks to try and explain the problematic issues in science. I fear “the incentive narrative” might only be half of the story at best, and could result in a whole new set of “incentives” that aren’t even good for science:
  
  https://www.talyarkoni.org/blog/2018/10/02/no-its-not-the-incentives-its-you/
  
  To try and make my point clearer, and using your post, what if “sexy outcomes” (which we have to possibly p-hack for to get) are merely a symptom of a “bad system” but not the real problematic issue. Perhaps “sexy findings” increase your chances of getting published in an “important” journal, which in turn could get you and your institution (media) attention, which in turn could increase your chances of getting that big grant, which in turn could increase the chances of you receiving all kinds of benefits (e.g. tenure) from your institution because they get a nice piece of that big grant you just received.
  
  Possible solutions for this “bad system” should then, for instance, not involve coming up with all kinds of solutions that costs lots of money. Or possible solutions should then perhaps not involve talking to the media about all these super solutions to help “change the incentives”. If you would do such things, you would effectively be doing the same things that possibly caused this whole mess in the 1st place. The “sexy study” of the past, will now simply be the “sexy improvement of science project”…
  
  I fear awareness of the real issues behind the possibly “bad system” has not been achieved yet. At least not by a large portion of scientists. And i therefore fear any real solutions have currently not yet been implemented.
  
  Here is Binswanger’s (2014) “Excellence by nonsense: the competition for publications in modern science”:
  
  https://link.springer.com/chapter/10.1007/978-3-319-00026-8_3
  
  Reply ↓
- Anonymous on December 14, 2018 7:06 PM at 7:06 pm said:
  
  1) Quote from above: “We need to change the system so academic researchers are rewarded for running high quality studies, regardless of outcome. That would align researcher incentives with scientific progress.”
  
  I think we should be very, very careful to not make things worse, and be very mindful of what could help to improve matters, and how and why exactly. General terms like “changing the incentives” and “rewarding researchers who perform high quality research” need to be worked out in much more detail when thinking and talking about these matters in my opinion. If not, i think it could be possible that we could end up coming up with a whole new set of “incentives” that may not even be good for science, and could potentially worsen things.
  
  For instance, if i were a director of a firm in the tobacco industry wanting to manipulate research on cigarettes by making clear my cigarettes have all kinds of benefits, i could set up a journal, hire some “special” editors and reviewers, come up with some “collaborative” research effort concerning all the benefits of tobacco but control the data collection and -analyses by having my friends execute them.
  
  Now, i can start writing about how the research system is “broken” because of “the incentives”. And i could write about how we need to “change the incentives” and “reward researchers who perform high quality science”. Luckily, i already have my own journal, my “special” editors and reviewers, and have designed a “collaborative” research project that i can all control and can put forward as a way to “improve” things.
  
  I can now tell everyone to join my project investigating the benefits of tobacco, especially those researchers that want to “improve” science and “help change the incentives”. After some time, i could even start making sure the people that joined my project get a special “badge”, or something like that, to put in their CV so employers could recognize them and possibly give them a job because “we” need to “reward” the researchers that perform “high quality science” and we need to “change the system”.
  
  You get the idea…
  
  2) Quote from above: “Under the current system where our professional standing is based on outcomes, it’s no wonder folks defend their outcomes to the death.”
  
  I find it a wonder, at least sometimes. For instance, why should a tenured researcher “defend their outcomes to the death”? Is their “professional standing based on outcomes” still? I can not understand why a tenured researcher would care about their p-hacked findings anymore.
  
  It seems to me, tenured researchers would have had the majority, if not all, of the benefits of their published “sexy findings” already according to the “publish or perish” and “i needed to publish sexy findings to get a job” narratives. If these narratives are correct, why should they still care about whether or not their “sexy p-hacked findings” replicate for instance?
  
  The only explanations for that behavior that i can come up with is their possible ego, some mental gymnastics because they can’t handle the truth that their findings are “fake”/”untrue”, and/or some other things that could be beneficial to them with regard to their sexy findings. I simply note that if i understood things correctly, all these possibilities are not included in the “publish or perish” and “i needed to publish sexy findings to get a job” narratives…
  
  Reply ↓
Eric B Rasmusen on December 13, 2018 11:44 AM at 11:44 am said:

Sometimes–usually, in fact, a correcting article, whether about math errors, data errors, or replication, doesn’t need any comment at all from the original authors. The correcting author and the editor should send it to the original author for comment, he should give his comment, if any (usually: “they’re right”), and that should be that. The corrector should thank the original author in his acknowledgements for comment on his correction, usually. I’ve had more than one article with math errors myself.

Each year I send out a list of “Top Ten” clippings with my Christmas cards. See Number 12 (“ten” isn’t literal). It’s an excellent example of a genuine public apology,by Fredrik deBoer, not in research, but for serious false allegatoins about someone’s behavior.

http://www.rasmusen.org/special/christmas/2018-articles.htm

Reply ↓
- Dale Lehman on December 13, 2018 12:17 PM at 12:17 pm said:
  
  Interesting apology, Eric. However, a strange thing about your links is that none of the other links appear to go where they are supposed to – they almost seem like random links. Is anyone else finding that – or do I have gremlins in my browser?
  
  Reply ↓
  - Eric B Rasmusen on December 13, 2018 1:32 PM at 1:32 pm said:
    
    My fault!
    
    I messed up. It wasn’t the context. It wasn’t differing cultures. It wasn’t my RA. It was me. I fixed it now. Probably. See
    
    http://www.rasmusen.org/special/christmas/2018-articles.htm
    
    Reply ↓
Elberi on December 13, 2018 12:10 PM at 12:10 pm said:

Indeed a very disturbing non-replication result of a very important paper.

However, I am not sure I am following your argument about the “sad” part.
What exactly is the problem with their response?
To start with, it was not a RESPONSE – the replication results were not available to the authors (Amir, Mazar and Ariely) when they wrote the commentary, as they state clearly:

“As we do not yet know the outcome of the replication attempt, we thought it instructive to describe some of the challenges …, and outline important theoretical and practical considerations to designing experiments in this context”

They then indeed BLUFF and FLUFF about why a replication might not happen (perhaps overly so, but I can understand them), but they DO acknowledge the issues with the method they used (“Our original experiment was run with students, not in a lab but in a large classroom, in one session, as part of a course requirement, in a particular location and at a particular time, as part of a packet that included other studies that we don’t have a record of, and set in a particular culture and country.”). So far – acknowledging limitations in methods we should all learn from, and they state they should have done better.

But, why would you expect them to state that there might be no effect, if they believe (HONESTLY) there is one? This was their hypothesis, you do not question their behavior in the original study (e.g., making up data), and they have not been informed of the replication results.

Would you state, for any paper of yours “perhaps there’s nothing there” just because someone asks to replicate it?

Please elaborate on what is so sad here. I am really not following.

Reply ↓
Nell on December 13, 2018 12:28 PM at 12:28 pm said:

Just to note: they say in the first paragraph that they did not know the replication results when they wrote the response… which perhaps makes this not so much “sad” as a response, but instead a reasonable categorization of what might affect the results? I wouldn’t expect them to suggest there is nothing there without having seen evidence to that effect.

“As we do not yet know the outcome of the replication attempt, we thought it instructive to describe some of the challenges involved in setting up the current replication attempt of such magnitude and breadth, and outline important theoretical and practical considerations to designing experiments in this context.”

Reply ↓
Dana on December 13, 2018 12:44 PM at 12:44 pm said:

I am not following… Why would they acknowledge “maybe there’s nothing there” before they even saw the replication results?

As you state, they probably didn’t do intentional wrongdoing. They hypothesized what they did – with true intentions and reliance on literature.
They then ran a study which they now acknowledge to suffer from many flaws (not only contextual as you, via Collins, claim they “bluffed”). They list them in their commentary.

So again – what is so sad here?

Reply ↓
Andrew on December 13, 2018 12:46 PM at 12:46 pm said:

Elberi, Nell, Dana:

There are a couple options here. First, now that the replication results are known, the original authors could update their reply or write something new in light of these results. Second, even before the replication results are known, I think the authors should consider the very real possibility that their original claims are just artifacts that would not be expected to show up even in an exact replication. Acknowledging limitations is fine, but once you start speculating about reasons for possible non-replication, I think it makes sense to consider the possibility that the original study was too noisy to find anything.

Regarding Elberi’s questions:

1. “But, why would you expect them to state that there might be no effect, if they believe (HONESTLY) there is one?” My answer is that, even if they believe there is an effect, they shouldn’t be so sure that such an effect is detectable in their study. This is the distinction between truth and evidence. If a study cannot not replicate, this does not necessarily mean the original scientific claim is false, it’s just that the study did not provide strong evidence for the claim. So, I’m not asking them to say “there might be no effect,” I’m asking them to say that their study does not provide the evidence that they were claiming. They can continue to believe in their effect even without strong evidence, just based on theory, or based on some combination of theory and weak evidence.

2. “Would you state, for any paper of yours ‘perhaps there’s nothing there’ just because someone asks to replicate it?” Maybe. It depends on the study. I might say perhaps there’s nothing there. I might say that I really really think that the replication will be successful. What I object to is the original authors giving a bunch of potential explanations for potential non-replication, but not entertaining the possibility that the study might replicate because the original results were artifactual and noisy. That’s my problem.

Reply ↓
- Eric B Rasmusen on December 13, 2018 1:35 PM at 1:35 pm said:
  
  I hadn’t noticed that their reply was even before any results were in. Strange. It’s sort of like saying, “Well, it’s nice you’re trying to replicate it, but our original study is so bad it probably isn’t worth even trying.:
  
  Reply ↓
yyw on December 13, 2018 1:15 PM at 1:15 pm said:

A lot of these non replications in psychology seem to be about studies that are based on the assumption that human behavior and emotion are not only highly malleable but can be manipulated with subtle and short-duration stimulus.

Reply ↓
- Andrew on December 13, 2018 1:40 PM at 1:40 pm said:
  
  Yyw:
  
  Yes, and this relates to what we’ve called the piranha problem: one can imagine one or two large and persistent effects of this sort, but it’s hard to picture a world with dozens of these large and persistent effects all coexisting.
  
  Reply ↓
- Anonymous on December 14, 2018 10:59 AM at 10:59 am said:
  
  “A lot of these non replications in psychology seem to be about studies that are based on the assumption that human behavior and emotion are not only highly malleable but can be manipulated with subtle and short-duration stimulus.”
  
  Yes, i am also under the impression that it is a certain specific type of study and/or topic that receives this massive attention and allocated resources. Like i wrote somewhere below on this post (http://statmodeling.stat.columbia.edu/2018/12/13/oh-hate-work-criticized-case-fails-attempted-replications-original-researchers-dont-even-consider-possibility-maybe-original-work-w/#comment-926448), i wonder more and more if it is still scientifically useful to keep performing these “Registered Replication Reports”…
  
  The only thing this all does is in my reasoning is:
  
  1) paint a possibly unrepresentative picture of the “replicability” and/or “trustworthiness” of psychology,
  
  2) possibly give repeated (media) attention to “less fruitful” research and research topics (which perhaps ironically will now possibly have a higher chance of resulting in follow-up research that in turn could have the same problematic issues, hereby starting the cycle all over again),
  
  3) possibly “trains” researchers (and perhaps even the media, and the general public) into believing massive large scale “collaborative” projects are the “best” and/or “only” way to gather reliable knowledge…
  
  Unless these 3 things are the goal of “Registered Replication Reports”, i would like to suggest that they could start thinking about what it is they are exactly doing, what they are explicitly and implicitly communicating, and what their efforts could result in. As far as i understand, the goal of psychological science is not to try and find an increasingly accurate effect size estimate for instance.
  
  They could possibly start by trying to find out if less resources can possibly be used to come to some representative and useful effect size estimates or other statistics. Perhaps they could:
  
  1) Take a look at all the Registered Replication Reports (RRR’s) performed thusfar, and randomly take 1, 2, 3, 4, 5, etc. individual labs from one of these RRR’s and their individual associated confidence intervals, effect sizes, p-values, no. of participants, etc.
  
  2) Compare the pooling of the information of these 1, 2, 3, 4, 5, etc. individual labs to the overall results of that specific RRR.
  
  3) Try and find out what the “optimal” no. of labs/participants could be to not waste possible unnecessary resources.
  
  Reply ↓
Jonathan on December 13, 2018 2:17 PM at 2:17 pm said:

I’ll try to be interesting. Did they do any grouping cross-checks? If randomized groupings generate different signs, then the total effect is more likely another grouping result, one which another grouping would tend to disprove. I see many simple grouping misunderstandings behind a lot of the replication issues. I wonder if they grasp the set theoretical notion that your final sample is actually a series of groups that make a group, and thus the overall group has a certain likelihood of reflecting the characteristics determined by looking at group variations. If they did check interior sign for instability, then I take this back.

Reply ↓
- Jonathan on December 13, 2018 3:02 PM at 3:02 pm said:
  
  Sorry, hit submit by accident.
  
  Any data set results reveals what it takes to attenuate and remove its significance. While I’m no longer shocked by the near or total absence of an explanatory model, I do wonder why people don’t construct negative models that are deemed to project into the data space. It’s as simple as visualizing an inversion along a z axis through variations or iterations of an x-y plane. That enables a complete injection and project across a purported or implied ‘origin’, which is actually a normalized sign choice. And you can flip piece by piece over y or x and then over the other.
  
  The reason I mention this is the experiment talked about is a deeply layered abstraction passing itself off as simple. An example is the history by which the Nazis step-by-step proceeded toward mass murder, thus enticing people to go along who might be appalled by a sudden leap to death camps. That implies there are in fact steps in a learning process which dehumanizes people and turns others into oppressors. I could make the same argument about slavery: if you follow the history of slavery, it shifted from ‘I have physical dominion over you’ to versions of ‘I’m morally better than you’. There are steps in awareness along that path. That there are steps we take for granted. That makes the idea of testing exposure to moral yes material intellectually appealing: if we accept that there are steps toward evil – if we accept there are steps at all – then there must (we hope) be steps toward good. Maybe it’s just asking people to recall the 10 Commandments.
  
  Where in the 10 Commandments does it say ‘don’t cheat on a matrix test?’ It isn’t murder. It isn’t sex with a married woman (because that’s what it actually says is outlawed – because that confuses parentage and causes violence). It isn’t coveting. It could be lying but was this within the confines of a honor code with meaning, like part of a college commitment, or was it just an agreement to take this test? Does the lying gain someone advantages? The point of the ‘lying’ commandment is bearing false witness, and that would be stretch. Aren’t the 10 Commandments more a lesson that little crap like this doesn’t matter, that you should not kill people, not mistreat parents or children, not steal, and not want someone else’s stuff (including wife) rather than going out and getting your own? These questions are meant to ask about the model: if you’re testing whether this object has a relationship, then you need to test the value you’ve placed on that object and how that is allocated across its attributes. It’s more obvious when you think about ‘books you read in high school’: aren’t those going to be loaded with moral lessons because those are the books you read in high school? Why wouldn’t those lessons, which you very possibly enjoyed reading and learning, have more of an impact than thinking about excerpts from an old religious books.
  
  When I look at this kind of experiment it reminds me of ESP: the idea that this object or this object, both extremely complex, have a greater than chance correlation to ‘cheating’ on an apparently arbitrary task is a lot like guessing at cards shown in some other room. It also reminds me of grouping confusions like college rankings: even if they used rational criteria rationally weighted using good data, what is the actual meaning of number 28 versus number 36 (unless there’s no massive data shift between)? Sensitivity forming bands, right? ESP tests ‘sensitivity’ and this kind of honesty thing tests for ‘sensitivity’. ESP separates people so the model actually disappears: there is no contextual sharing other than the existence of an experiment – and that can be added by not having a first reader look at cards but saying one is and other variations like saying there is no other person involved just a deck of cards flipped one by one on a table by a robot in a room that may or may not be located in this reality. It’s remarkable they’ve gotten away with the construction of the no possible model scenario for so long without anyone recognizing they’ve just violated every concept by which we know things can entangle, which is that there is some real level context of shared existence. It’s a violation of the basic construction of entanglement problems. In the honesty question, they’ve so abstracted the data objects that the objects have unstable signs: which do you think makes you most moral? Which causes you the most resentment? I can come up with dozens of questions like that. Rather than be unentangled to the point of no relation, these are hyper-entangled.
  
  Reply ↓
Corey on December 13, 2018 4:15 PM at 4:15 pm said:

I have a notion that some economists, at least, don’t regard insider trading as particularly unethical from the perspective of getting accurate price information into the market.

Reply ↓
Björn on December 14, 2018 2:35 AM at 2:35 am said:

While the original article does mention that they used students etc., it in fact talks a lot about how they are “a lot
more similar to the general public” (than convicts). There was in fact rather a lot of generalization about “…when *people* had the ability to cheat…”, lots of talk of “people” (in general), how the morals of “people” work, stuff about incorporating the findings into economic models etc., which makes it a bit inconsistent to suggest that failed replication is due to some differences in setting. I guess it’s a pattern: originally claim a wide applicability and if challenged with data, retreat to defensible ground no matter how small. I conjecture that people (of course in general) feel that its less of a loss of face, if they argue that “Well, on that particular day at 16:37 with this particular group of people at this particular venue, we did observe this. It may not apply even to a different group with the same background (or even the same group) at any other date or time at a different venue (or even the same venue – obviously the by now faded wallpaper will interfere with the experiment), but it is a broadly generalizable claim that deserves to be widely published.”

Reply ↓
Dan F. on December 14, 2018 3:12 AM at 3:12 am said:

“Really, what’s the point of doing all this work if you’re not going to be open to learning?”

This has a very simple answer that people who work in fancy universities in the US can’t see. It counts as a publication whether or not its wrong, and the goal is to publish, because those are the institutionalized incentives. There’s nothing more to it.

Doing good science/statistics hardly enters the thinking process. People raised and trained in this sort of incentive structure often have seen no model behavior at all different from their own. It never occurs to them that there is another way to do things, much less a better way to do things.

That one studies for the sake of understanding is viewed as pie in the sky idealism of the sort only mathematicians and other social parasites permit themselves.

Reply ↓
- Ethan Bolker on December 14, 2018 8:42 PM at 8:42 pm said:
  
  I’m a mathematician. Do I take your comment as a compliment or an insult?
  
  Reply ↓
Anonymous on December 14, 2018 5:20 AM at 5:20 am said:

I am wondering more and more whether these “Registered Replication Reports” are really a good idea to pursue further. If i understood things correctly, they involve:

1) Possibly wasting resources by using 5000-6000 participants on average (please note that that’s a guess based on the several RRR’s i am familiar with)…Why do they always (seem to) use around 5000-6000 people for a RRR ? Why not 10 x more, or 10 x less ?…

2) Possibly giving way too much power and influence to a small groups of editors and reviewers (all belonging to a single, specific “club” and/or journal) who determine which research is “worthy” of an RRR…

3) Possibly giving attention to sub-standard research and researchers for a 2nd time hereby possibly repeating the exact problematic cycle that has caused the research to become “influential” enough to be even considered for a RRR in the 1st place…

I appreciate the insights that RRR’s have given thus far, but i sincerely wonder if they are a scientifically optimal, and ethical, way to conduct research. I would argue they are not, and possibly set a “bad” precedent from a scientific perspective.

Instead of the next “influential” (probably p-hacked) study (of a certain kind) to be replicated using thousands of participants, why not replicate some recent pre-registered studies that used a few hundred participants at most (i.c. a similar number that the “failed” non-pre registered original studies of the RRR’s performed thus far have used).

I think that could lead to interesting and useful information at this point in time. I am not sure what the scientific use is of keep doing RRR’s chosen by a very select group of people, of the same kind of studies (that are likely to not successfully replicate), using thousands of participants, over and over again….

Reply ↓
Fritz Strack on December 14, 2018 5:43 AM at 5:43 am said:

In science, noise means ignorance.

Reply ↓
- Kyle C on December 14, 2018 10:56 AM at 10:56 am said:
  
  “Personally, I felt that this [published paper] actually had a good chance to [replicate],” [Professor Strack] said. How good a chance? “I gave it a 30-percent shot.”
  
  Easily Googlable.
  
  Reply ↓
Terry on December 14, 2018 8:27 AM at 8:27 am said:

I guess that if I’d read it in 2008 when it appeared, I’d’ve believed all its claims too. A quick scan shows no obvious problems with the data or analyses.

Have things changed since 2008? Would this article have been published in 2018?

Seeing the 95% CI interval hover just above zero makes me almost completely discount this paper now, or to be more specific, makes me put it in the “maybe there is something to this, but I’m going to ignore it for now because it is probably a waste of my time” bucket.

Reply ↓
Terry on December 14, 2018 8:59 AM at 8:59 am said:

I’m actually surprised this intervention didn’t work.

Surely moral exhortation must have an effect some of the time. Otherwise, why would we have so much of it? Is there a literature that has shown that some types of moral exhortation work? Lifelong religious schooling must have some effects. So what kinds of moral exhortation work?

The collapse of the priming literature can’t be taken to prove that moral exhortation never works. Has only a part of the priming literature collapsed? Is so, what survives?

Is this priming stuff supposed to be interesting because it isn’t overt, but rather sneaky “priming” which is camouflaged as something innocuous? Or is this priming stuff different because the priming is thought to be trivial? The Mazar et al. paper doesn’t say this. The paper takes its priming quite seriously.

So what’s the bottom line?

Reply ↓
- Dzhaughn on December 14, 2018 12:56 PM at 12:56 pm said:
  
  “Otherwise, why would we have so much of it?”
  
  Moral exhortation may “work” in a sense that does not concern the exhorted or their behavior. For instance, certain moral exhortations could be one of several markers for membership or leadership in a social group.
  
  Reply ↓
- Martha (Smith) on December 15, 2018 12:28 AM at 12:28 am said:
  
  “Surely moral exhortation must have an effect some of the time. Otherwise, why would we have so much of it? ”
  
  I speculate that some people are inclined to engage in “moral exhortation”, and are not influenced (or not very much) by whether not it has any effect. It may just seem to be “the right (or appropriate) thing to do.
  
  Reply ↓
Terry on December 14, 2018 9:16 AM at 9:16 am said:

This is a pretty impressive replication.

Indeed, it seems to be a frequentist’s dream: the same experiment performed repeatedly. Can some additional mileage be wrung out of this? I dunno … are the results normally distributed? Something more interesting?

Reply ↓
- Anonymous on December 14, 2018 11:08 AM at 11:08 am said:
  
  “Can some additional mileage be wrung out of this?”
  
  There have now been published 5,6,or 7 (or something along these numbers) “Registered Replication Reports” (RRR) that i think could in their totality be used to investigate if fewer resources can be used to gather the same (gist of the) information coming from a RRR. I am really bad with statistics and computers, but i reasoned the following could possibly be interesting and/or useful:
  
  1) Take a look at all the Registered Replication Reports (RRR’s) performed thus far, and randomly take 1, 2, 3, 4, 5, etc. individual labs from one of these RRR’s and their individual associated confidence intervals, effect sizes, p-values, no. of participants, etc.
  
  2) Compare the pooling of the information of these 1, 2, 3, 4, 5, etc. individual labs to the overall results of that specific RRR.
  
  3) Do this for all the “Registered Replication Reports” performed thus far, and try and find out if, and what, the “optimal” no. of labs/participants could be to not waste possible unnecessary resources.
  
  Reply ↓
  - Terry on December 14, 2018 11:12 PM at 11:12 pm said:
    
    10 replications does sound a bit much.
    
    Reply ↓
    - Anonymous on December 15, 2018 1:37 PM at 1:37 pm said:
      
      “10 replications does sound a bit much.”
      
      If i am not mistaken, they performed 25 of them?!
      
      From the above blog post: “The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786)”
Terry on December 14, 2018 12:58 PM at 12:58 pm said:

They should add genetic info to these priming studies. Or better yet just drop the priming and collect genetic info and look at how genetics predicts behavior in the experiment. The genetic studies seem to be finding a lot of links to behavior. Maybe it is time to shift focus. Sounds like an easy way for young researchers to get some reliable results. Maybe the old gold mine is played out and it is time to dig a new hole.

Reply ↓
- Martha (Smith) on December 15, 2018 12:34 AM at 12:34 am said:
  
  “The genetic studies seem to be finding a lot of links to behavior.”
  
  But are these studies good quality, or to they find these “links” by p-hacking or other questionable research practices?
  
  Reply ↓
  - Terry on December 15, 2018 9:46 AM at 9:46 am said:
    
    I have no idea, but it sounds like a new direction that should be looked into.
    
    Reply ↓
Mayo on December 14, 2018 6:39 PM at 6:39 pm said:

This is the first I’ve heard of this study but it’s weird in a number of ways. First, you tell people to try to remember the 10 commandments, and then give them a chance to make money by cheating on the # solved. The10 commandments don’t talk of cheating, but presumably the subject gets the message: think about morality. It’s not subliminal. Why not make it more blatant and just announce “now let me remind everyone that it is wrong to cheat and take $ you haven’t earned. Just because no one will see if you’ve deceived, you know it is morally wrong” and then do the test. That should prevent everyone from cheating. Test takers will figure there’s a way the testers can tell.

Could even take the same people and give them a short test on matrices, but no shredding. Then split high and low scorers and give them tests both with and without moral warning but with shredding (and anonymously pay them).

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.

44 thoughts on “Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.”

Leave a Reply Cancel reply