Pushing the guy in front of the trolley

So. I was reading the London Review of Books the other day and came across this passage by the philosopher Kieran Setiya:

Some of the most striking discoveries of experimental philosophers concern the extent of our own personal inconsistencies . . . how we respond to the trolley problem is affected by the details of the version we are presented with. It also depends on what we have been doing just before being presented with the case. After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five. . . .

I’m not up on this literature, but I was suspicious. Watching a TV show for 5 minutes can change your view so strongly?? I was reminded of the claim from a few years ago, that subliminal smiley faces had huge effects on attitudes toward immigration—it turns out the data showed no such thing. And I was bothered, because it seemed that a possibly false fact was being used as part of a larger argument about philosophy. The concept of “experimental philosophy”—that’s interesting, but only if the experiments make sense.

So I thought I’d look into this particular example.

I started by googling *saturday night live trolley problem* which led me to this article in Slate by Daniel Engber, “Does the Trolley Problem Have a Problem?: What if your answer to an absurd hypothetical question had no bearing on how you behaved in real life?”

OK, so Engber’s skeptical too. I searched in the article for Saturday Night Live and found this passage:

Trolley-problem studies also tell us people may be more likely to favor the good of the many over the rights of the few when they’re reading in a foreign language, smelling Parmesan cheese, listening to sound effects of people farting, watching clips from Saturday Night Live, or otherwise subject to a cavalcade of weird and subtle morality-bending factors in the lab.

Which contained a link to this two-page article in Psychological Science by Piercarlo Valdesolo and David DeSteno, “Manipulations of Emotional Context Shape Moral Judgment.”

From that article:

The structure of such dilemmas often requires endorsing a personal moral violation in order to uphold a utilitarian principle. The well-known footbridge dilemma is illustrative. In it, the lives of five people can be saved through sacrificing another. However, the sacrifice involves pushing a rather large man off a footbridge to stop a runaway trolley before it kills the other five. . . . the proposed dual-process model of moral judgment suggests another unexamined route by which choice might be influenced: contextual sensitivity of affect. . . .

We examined this hypothesis using a paradigm in which 79 participants received a positive or neutral affect induction and immediately afterward were presented with the footbridge and trolley dilemmas embedded in a small set of nonmoral distractors.[1] The trolley dilemma is logically equivalent to the footbridge dilemma, but does not require consideration of an emotion-evoking personal violation to reach a utilitarian outcome; consequently, the vast majority of individuals select the utilitarian option for this dilemma.[2]

Here are the two footnotes to the above passage:

[1] Given that repeated consideration of dilemmas describing moral violations would rapidly reduce positive mood, we utilized responses to the matched set of the footbridge and trolley dilemmas as the primary dependent variable.

[2] Precise wording of the dilemmas can be found in Thomson (1986) or obtained from the authors.

I don’t understand footnote 1 at all. From my reading of it, I’d think that a matched set of the dilemmas corresponds to each participant in the experiment getting both questions, and then in the analysis having the responses compared. But from the published article it’s not clear what’s going on, as only 77 people seem to have been asked about the trolley dilemma compared to 79 asked about the footbridge—I don’t know what happened to those two missing responses—and, in any case, the dependent or outcome variable in the analyses are the responses to each question, one at a time. I’m not saying this to pick at the paper; I just don’t quite see how their analysis matches their described design. The problem isn’t just two missing people, it’s also that the numbers don’t align. In the data for the footbridge dilemma, 38 people get the control condition (“a 5-min segment taken from a documentary on a small Spanish village”) and 41 get the treatment (“a 5-min comedy clip taken from ‘Saturday Night Live'”). The entire experiment is said to have 79 participants. But for the trolley dilemma, it says that 40 got the control and 37 got the treatment. Maybe data were garbled in some way? The paper was published in 2006 so long before data sharing was any sort of standard, and this little example reminds us why we now think it good practice to share all data and experimental conditions.

Regarding footnote 2: I don’t have a copy of Thomson (1986) at hand, but some googling led me to this description by Michael Waldmann and Alex Wiegmann:

In the philosopher’s Judith Thomson’s (1986) version of the trolley dilemma, a situation is described in which a trolley whose brakes fail is about to run over five workmen who work on the tracks. However, the trolley could be redirected by a bystander on a side track where only one worker would be killed (bystander problem). Is it morally permissible for the bystander to throw the switch or is it better not to act and let fate run its course?

Now for the data. Valdesolo and DeSteno find the following results:

– Flip-the-swithch-on-the-trolley problem (no fat guy, no footbridge): 38/40 flip the switch under the control condition, 33/37 flip the switch under the “Saturday Night Live” condition. That’s an estimated treatment effect of -0.06 with standard error 0.06.

– Footbridge problem (trolley, fat guy, footbridge): 3/38 push the man under the control condition, 10/41 push the man under the “Saturday Night Live” condition. That’s an estimated treatment effect of 0.16 with standard error 0.08.

So from this set of experiments alone, I would not say it’s accurate to write that “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five.” For one thing, it’s not clear who the participants are in these experiments, so the description “Americans” seems too general. But, beyond that, we have a treatment with an effect -0.06 +/- 0.06 in one experiment and 0.16 +/- 0.08 in another: the evidence seems equivocal. Or, to put it another way, I wouldn’t expect such a large difference (“three times more likely”) to replicate in a new study or to be valid in the general population. (See for example section 2.1 of this paper for another example. The bias occurs because the study is noisy and there is selection on statistical significance.)

At this point I thought it best to dig deeper. Setiya’s article is a review of the book, “Philosophy within Its Proper Bounds,” by Edouard Machery. I looked up the book on Amazon, searched for “trolley,” and found this passage:

From this I learned that were some follow-up experiments. The two papers cited are Divergent effects of different positive emotions on moral judgment, by Nina Strohminger, Richard Lewis, and David Meyer (2011), and To push or not to push? Affective influences on moral judgment depend on decision frame, by Bernhard Pastötter, Sabine Gleixner, Theresa Neuhauser, and Karl-Heinz Bäuml (2013).

I followed the link to both papers. Machery describes these as replications, but none of the studies in question are exact replications, as the experimental conditions differ from the original study. Strohminger et al. use audio clips of comedians, inspirational stories, and academic lectures: no Saturday Night Live, no video clips at all. And Pastötter et al. don’t use video or comedy: they use audio clips of happy or sad-sounding music.

I’m not saying that these follow-up studies have no value or that they should not be considered replications of the original experiment, in some sense. I’m bringing them up partly because details matter—after all, if the difference between a serious video and a comedy video could have a huge effect on a survey response, one could also imagine that it makes a difference whether stimuli involve speech or music, or whether they are audio or video—but also because of the flexibility, the “researcher degrees of freedom,” involved in whether to consider something as a replication at all. Recall that when a study does not successfully replicate, a common reaction is to point out differences between the old and new experimental conditions and then declare that that the new study was not a real replication. But if the new study’s results are in the same direction as the old’s, then it’s treated as a replication, no questions asked. So the practice of counting replications has a heads-I-win, tails-you-lose character. (For an extreme example, recall Daryl Bem’s paper where he claimed to present dozens of replications of his controversial ESP study. One of those purported replications was entitled “Further testing of the precognitive habituation effect using spider stimuli.” I think we can be pretty confident that if the spider experiment didn’t yield the desired results, Bem could’ve just said it wasn’t a real replication because his own experiment didn’t involve spiders at all.)

Anyway, that’s just terminology. I have no problem with the Strohminger et al. and Pastötter et al. studies, which we can simply call follow-up experiments.

And, just to be clear, I agree that there’s nothing special about an SNL video or for that matter about a video at all. My concern about the replication studies is more of a selection issue: if a new study doesn’t replicate the original claim, then a defender can say it’s not a real replication. I guess we could call that “the no true replication fallacy”! Kinda like those notorious examples where people claimed that a failed replication didn’t count because it was done in a different country, or the stimulus was done for a different length of time, or the outdoor temperature was different.

The real question is, what did they find and how do these findings relate to the larger claim?

And the answer is, it’s complicated.

First, the two new studies only look at the footbridge scenario (where the decision is whether to push the fat man), not the flip-the-switch-on-the-trolley scenario, which is not so productive to study because most people are already willing to flip the switch. So the new studies to not allow comparison the two scenarios. (Strohminger et al. used 12 high conflict moral dilemmas; see here)

Second, the two new studies looked at interactions rather than main effects.

The Strohminger et al. analysis is complicated and I didn’t follow all the details, but I don’t see a direct comparison estimating the effect of listening to comedy versus something else. In any case, though, I think this experiment (55 people in what seems to be a between-person design) would be too small to reliably estimate the effect of interest, considering how large the standard error was in the original N=79 study.

Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?”:

I guess the question is, does the constellation of these results represent a replication of the finding that “situational cues or causal factors influencing people’s affective states—emotions or moods—have consistent effects on people’s general judgments about cases”?

And my answer is: I’m not sure. With this sort of grab bag of different findings (sometimes main effects, sometimes interactions) with different experimental conditions, I don’t really know what to think. I guess that’s the advantage of large preregistered replications: for all their flaws, they give us something to focus on.

Just to be clear: I agree that effects don’t have to be large to be interesting or important. But at the same time it’s not enough to just say that effects exist. I have no doubt that affective states affect survey responses, and these effects will be of different magnitudes and directions for different people and in different situations (hence the study of interactions as well as main effects). There have to be some consistent or systematic patterns for this to be considered a scientific effect, no? So, although I agree that effects don’t need to be large, I also don’t think a statement such as “emotions influence judgment” is enough either.

One thing that does seem clear, is that details matter, and lots of the details get garbled in the retelling. For example, Setiya reports that “Americans are three times more likely” to say they’d push someone, but that factor of 3 is based on a small noisy study on an unknown population, and for which I’ve not seen any exact replication, so to make that claim is a big leap of faith, or of statistical inference. Meanwhile, Engber refers to the flip-the-switch version of the dilemma, for which case the data show no such effect of the TV show. More generally, everyone seems to like talking about Saturday Night Live, I guess because it evokes vivid images, even though the larger study had no TV comedy at all but compared clips of happy or sad-sounding music.

What have we learned from this journey?

Reporting science is challenging, even for skeptics. None of the authors discussed above—Setiya, Engber, or Machery—are trying to sell us on this research, and none of them have a vested interest in making overblown claims. Indeed, I think it would be fair to describe Setiya and Engber as skeptics in this discussion. But even skeptics can get lost in the details. We all have a natural desire to smooth over the details and go for the bigger story. But this is tricky when the bigger story, whatever it is, depends on details that we don’t fully understand. Presumably our understanding in 2018 of affective influences on these survey responses should not depend on exactly how an experiment was done in 2006—but the description of the effects are framed in terms of that 2006 study, and with each lab’s experiment measuring something a bit different, I find it very difficult to put everything together.

This relates to the problem we discussed the other day, of psychology textbooks putting a complacent spin on the research in their field. The desire for a smooth and coherent story gets in the way of the real-world complexity that motivates this research in the first place.

There’s also another point that Engber emphasizes, which is the difference between a response to a hypothetical question, and an action in the external world. Paradoxically, one reason why I can accept that various irrelevant interventions (for example, watching a comedy show or a documentary film) could have a large effect on the response to the trolley question is that this response is not something that most people have thought about before. In contrast, I found similar claims involving political attitudes and voting (for example, the idea that 20% of women change their presidential preference depending on time of the month) to be ridiculous, on part because most people already have settled political views. But then, if the only reason we find the trolley claims plausible is that people aren’t answering them thoughtfully, then we’re really only learning about people’s quick reactions, not their deeper views. Quick reactions are important too; we should just be clear if that’s what we’re studying.

P.S. Edouard Machery and Nina Strohminger offered useful comments that influenced what I wrote above.

33 thoughts on “Pushing the guy in front of the trolley

    • Thanks for the link to this video. I found myself riveted to watching it, and at the same time, almost physically ill. But I learned almost nothing about the basic question – and a fair amount about researchers. The number of forked paths in experimental design (and if you watch the video, you may appreciate the pun) is staggering. And, the researchers apparently never heard of selection bias – they “pre-screened” participants to avoid people who might be damaged by the experiment. And, they took away cell phones. And the relationship between the participants and they action was technologically mediated. And, the sample size was tiny. And, the experimental conditions varied from one subject to another. And and and and

      Then there is the question of whether we really could learn anything meaningful from the experiment to begin with. Would the answer give us any useful information about how to design self-driving vehicles?

  1. In combat I would think that these sorts of trolley-decisions are made all the time. Has anyone studied that? How about officer training? Are officers taught anything relevant about these tradeoffs?

    • A very good point, Terry.

      I find the trolley problem too silly to think seriously about, a position expressed in this comic:
      https://existentialcomics.com/comic/106

      Decisions about choices like these are made in the heat of the moment, not sitting in a chair, safely in a campus room. Training can certainly help — like Sully having all those hours in a flight simulator helped him decide to land that airliner in the Hudson and be successful doing it.

      The people who throw themselves in front of a shooter, for example
      https://www.cnn.com/2019/05/14/us/joshua-jones-stem-school-shooting-survivor/index.html
      don’t seem to have done complicated reasoning. From the article:

      The 18-year-old was forced to make a decision he hopes no one else will ever have to make, he told reporters on Tuesday. He was shot in his left calf and thigh during the attack.
      In the moment, he said, he didn’t consciously decide to run toward danger.
      “Adrenaline and tunnel vision are crazy things,” Jones said, with his parents sitting by his side. “You get what you’re doing done, and then later you realize what’s happened.”
      And he didn’t think of himself as a hero, he said.
      “You never expect to make that choice at any point in your life.”

      This is reminiscent of William James theory of emotion:

      “In 1884, James published a seminal paper titled What is An Emotion in the philosophy journal Mind …. In this paper, he reasoned that human emotion followed a sequence of events beginning with an arousing stimulus (i.e., physiological arousal linked to the sympathetic and parasympathetic nervous system) which then triggered the corresponding emotion. In other words, do we run from a bear because we are afraid or are we afraid because we are running from the bear? While the commonplace assumption is that the bear is the source of our fear, James argued that this commonsense interpretation is wrong. It was James’ contention that bodily changes result from the perception of the “exciting fact” which in term leads to the psychological sensation called emotion.” https://drvitelli.typepad.com/providentia/2007/05/william_james_a.html

      i.e. all students are aroused by the presence of the gunman. Some rush the gunman, some do not. Do we believe that those who rush, or not rush, would differ in their answers to the trolley problem if it had been presented to them beforehand? We’ll have to leave this as a thought experiment, but I don’t think so.

      In terms of training, we might consider the bicycling manuever of the Emergency Quick Turn.
      http://www.bamacyclist.com/articles/QuickTurn.htm
      “You must steer the handlebars in the opposite direction you want to turn. That means steering the bike toward the danger first! … This maneuver is not an intuitive phenomenon. You have to practice it over and over until it becomes part of your normal riding habits. … When you are faced with this situation it’s too late to try to remember how to do it correctly, you have to react quickly and do it automatically!”

      This is completely correct. Reading about it, or passing a test on it is one thing. Having it be your intuitive reaction is another. So sitting in a seminar room discussing the trolley problem, or dryly learning about the rules of engagement, isn’t going to help much.

      • “In 1884, James published a seminal paper titled What is An Emotion in the philosophy journal Mind …. In this paper, he reasoned that human emotion followed a sequence of events beginning with an arousing stimulus (i.e., physiological arousal linked to the sympathetic and parasympathetic nervous system) which then triggered the corresponding emotion. In other words, do we run from a bear because we are afraid or are we afraid because we are running from the bear? …”

        I have become skeptical of almost any explanation of human emotion and behavior that is not connected rather directly from evolution (or at least strongly consistent with evolution).

      • When you think about it, it is pretty perplexing that the trolley researchers have not studied combat behavior. We have millennia of knowledge about combat behavior. Why don’t we just ask experienced veteran commanders how they make life-and-death decision? Millions of soldiers have been sent to face cannonballs. Why are we making up hokey questions for undergrads instead?

        • Good point — although one needs to keep in mind that experienced veteran commanders are not representative of the population at large, so their experiences/actions cannot be generalized to other populations.

  2. I find this type of experimental philosophy misguided. For one, most people are not trained to think like philosophers. If you tell them to assume that throwing the switch will save the lives of the five but kill one, they simply won’t play along. They may think to themselves, “I could probably scream to the one guy to get off of the track” or “do I really know that the trolley will switch tracks maybe the sudden switch will make the trolley come off the track and kill everyone on board.” They just won’t be disciplined deductive thinkers. It takes training to think that way. We cannot infer from the responses an actual philosophical view on the question. Second, even the philosophers you quote seem to misunderstand the problem. They keep calling the response to switch tracks “utilitarian,” but it has nothing to do with utilitarianism. A deontologist could reason that she has a duty to save lives, and that such duty can trump the duty not to take a life when the number of lives saved is greater than those killed. Most deontologists aren’t pacifists after all. Likewise, a rule utilitarian could reason that killing to save lives is a bad rule that in general will lead to lower levels of utility. (Some utilitarians are pacifists.) Even an act-utilitarian is going to have some conditions under which he does not have to act. The way the story is told it sounds cost less to act, but one can imagine all sorts of costs associated with pulling the lever (like being put on trial for murder) that might weigh against action even for the act-utilitarian. Thus, I don’t think even if you did this experiment on philosophy professors you would be able to tell their philosophical views from their responses.

    • It’s a little like those bad psychometric scales which are worded to try and cue the research subject into some subtlety or another of the construct the researcher wishes to measure. Or in measuring behavior, sometimes a survey will cue the subject with picky little distinctions between one type of behavior (for example “moderate physical activity”) and another related behavior (like “vigorous physical activity”) when, to the person taking the survey, no such distinction exists.

      The hubris of thinking you can somehow induce a research participant into thinking like the researcher is universal across fields, not specific to experimental philosophy.

    • A very good point in the background of your comment that I would like to bring to the fore: Some unknown number of people may respond to the trolley problem based on what they think “the law” says (i.e., could they be charged with murder or sued for wrongful death for doing certain things). And there’s nothing wrong with that! But it’s not exactly “ethics.” As in so many of these puzzles, it’s unclear what question people are really answering.

      • > Some unknown number of people may respond to the trolley problem based on what they think “the law” says […] But it’s not exactly “ethics.”

        I think Kant would disagree.

      • Exactly, they could have all sorts of reasons for deciding to pull the lever or not. If you have ever taught freshmen philosophy, they don’t get “Assume the following . . .” They keep challenging the assumptions. They change the ethical scenario in their head and then comment on that. People do it when they in classes designed to teach them deductive reason. How, can we assume a bunch of study participants are going to accept the scenario as given in a study.

      • Kyle said,
        “As in so many of these puzzles, it’s unclear what question people are really answering.”

        +1

        (And this applies also to many other social science experiments, where questions can be interpreted differently by different people.)

    • When I was teaching I taught a course on decision theory to general students in the Honors College program (two different universities). On the first class day I handed out a “quiz” with a bunch of multiple choice answers to a number of questions. One of the questions was the trolley problem.

      Or, more correctly, two of the questions were the trolley problem, one being the “pull the switch” version and the other the “push the fat man off the bridge” version. Except that the left side of the class got one form of the question (and several others designed to illustrate different issues), and the right side got the other form. Most of the questions were identical on both sides of the class except for the 3 or 4 written especially to illustrate the issues that happen when supposedly identical dilemmas are expressed in different words.

      So after the students were finished with the “quiz” we went through them one by one. I would ask them to raise their hand if they answered A, then if they answered B (for example, usually there were only two choices. When we get to the trolley question, half the class raises their hands for choice A, then half raises it for choice B, and since the hands are going up almost exclusively on one or the other side of the seminar table, the students see the difference and wonder what’s going on. Then I have a student on the left side read the question that she has, and then a student on the right, and they see that although it appears that the questions are asking the same moral question (in this case), the different situations have produced different outcomes. And then we discuss the reasons for this, amongst them (in this question) being that different parts of the brain are being activated in the two different versions of the trolley problem.

      Other examples include expressing the same outcome in terms of gains instead of loss (e.g., so many soldiers are saved out of 100, or in the other form so many soldiers are lost out of 100)…which brings in some ideas of behavioral economics, and so on. I found this little quiz to be a big help in getting the class used to active discussions, which for this seminar-style course was exactly what I wanted.

  3. “Pastötter et al. had no comedy at all and found no main effect; rather, as reported by Machery, they found an effect whose sign depended on framing (whether the question was asked as, “Do you think it is appropriate to be active and push the man?” or “Do you think it is appropriate to be passive and not push the man?””

    This seems like a really strange way to ask, if this is all they asked. I think it is ‘appropriate’ to push the man, and also ‘appropriate’ not to push the man. I would answer ‘yes’ to both questions. Obviously I think one of them is better than the other, but that’s not what the question asks.

  4. Unless they actually had to flip a switch and killed something in the experiment, I am not sure how meaningful these findings, even if true, are.

    On the other hand, how about a dubious finding of huge implication: WSJ reported that reducing school bus emission improves student academic performance and a lot more cost effective than investing in teachers too!
    https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwi0qpaZp7LiAhUFDq0KHVLSCl0QFjAAegQIBBAB&url=https%3A%2F%2Fwww.wsj.com%2Farticles%2Fthe-surprising-academic-impact-of-reducing-school-bus-emissions-11558471990&usg=AOvVaw1-EQajrNHb0fX1CY7Cvq-n

  5. Is there not a vast overlap in the spaces of “ethical questions where responses from people are noisy (conditioned on small changes in context, language, or scenario)” and of “ethical questions that are interesting to philosophers?”

    • +1

      Or, from another perspective: “ethical questions that are interesting to philosophers” are often not well-defined. (I think this is what so often frustrated me about them. My mind kept saying, “It depends …”)

  6. “After five minutes of watching Saturday Night Live, Americans are three times more likely to agree with the Tibetan monks that it is permissible to push someone in front of a speeding train carriage in order to save five”

    After five minutes of watching contemporary Saturday Night Live, I’d be tempted to throw myself in front of a speeding train carriage in order to save myself from having to watch five more minutes.

  7. Hi Andrew,

    Yes, this was way before adoption of best practices in reporting, but let me see if I can clear up at least our part. Footnote 1 means (as well as i can remember) that participants always completed the footbridge and trolly dilemmas prior to completing other distractors (which were then presented in a random order). We did it this way, as our primary interest (if only we had preregistration back then) was in examining differences to these two dilemmas, and the other distractors weren’t important, but we wanted them for cover. So of the 79 people, they each got the footbridge and trolley dilemmas in a random order and then moved on to others in a random order. Again, footnote 1 was meant to imply that considering these “downer” dilemmas would quickly begin to dissipate positive mood, hence our primary dv’s came first.

    There were 79 people in total (38 and 41 in the two conditions) as noted in the table, so I’m not sure where the 77 issue comes from.

    At the time, this study was meant to provide an experimental test of Josh Greene’s correlational imaging findings. That is, Greene had argued that “emotion” centers in the brain (it was a long time ago imaging speaking) were more active in the footbridge dilemma and thus emotions were more responsible for changing decisions here. We wanted to examine this hypothesis by actually manipulating emotional states.

    David DeSteno

  8. Oh, sorry. I see in footnote 3 there were only 77 responses. My memory (which could be flawed, and at the time Psychological Science was only printing minimal methods info for short reports) was that we had a generous “time-out” set for each decision. So these two “no responses” likely occurred because participants didn’t respond in the time window allowed. But I can’t be sure.

    • The way I heard the way it formulated, the man being pushed in front of the trolley is very fat, fat enough to stop the trolley; but you are not fat enough to stop the trolley so committing suicide doesn’t resolve the problem (6 die instead of just 1).

      • Bill:

        Ahhh, philosophical problems for thin people! I hadn’t thought about the interaction between the question and the physical characteristics of the person being asked. I’m somehow reminded of Nicholas Wade’s aside, “Imagine you, as an English speaker of European descent …” That was a man who knew his audience!

Leave a Reply to zbicyclist Cancel reply

Your email address will not be published. Required fields are marked *