Pizzagate and Kahneman, two great flavors etc.

[cat picture]

1. The pizzagate story (of Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years”) keeps developing.

Last week someone forwarded me an email from the deputy dean of the Cornell business school regarding concerns about some of Wansink’s work. This person asked me to post the letter (which he assured me “was written with the full expectation that it would end up being shared”) but I wasn’t so interested in this institutional angle so I passed it along to Retraction Watch, along with links to Wansink’s contrite note and a new post by Jordan Anaya listing some newly-discovered errors in yet another paper by Wansink.

Since then, Retraction Watch ran an interview with Wansink, in which the world-renowned eating behavior expert continued with a mixture of contrition and evasion, along with insights into his workflow, for example this:

Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.

This is weird for two reasons. First, how do you say “we realized we asked . . .”? What’s to realize? If you asked the question that way, wouldn’t you already know this? Second, who eats 12 pieces of pizza? I guess they must be really small pieces!

Wansink also pulls one out of the Bargh/Baumeister/Cuddy playbook:

Across all sorts of studies, we’ve had really high replication of our findings by other groups and other studies. This is particularly true with field studies. One reason some of these findings are cited so much is because other researchers find the same types of results.

Ummm . . . I’ll believe it when I see the evidence. And not before.

In our struggle to understand Wansink’s mode of operation, I think we should start from the position that he’s not trying to cheat; rather, he just doesn’t know what he’s doing. Think of it this way: it’s possible that he doesn’t write the papers that get published, he doesn’t produce the tables with all the errors, he doesn’t analyze the data, maybe he doesn’t even collect the data. I have no idea who was out there passing out survey forms in the pizza restaurant—maybe some research assistants? He doesn’t design the survey forms—that’s how it is that he just realized that they asked that bizarre 0-to-12-pieces-of-pizza question. Also he’s completely out of the loop on statistics. When it comes to stats, this guy makes Satoshi Kanazawa look like Uri Simonsohn. That explains why his response to questions about p-hacking or harking was, “Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it.”

What Wansink has been doing for several years is organizing studies, making sure they get published, and doing massive publicity. For years and years and years, he’s been receiving almost nothing but positive feedback. (Yes, five years ago someone informed his lab of serious, embarrassing flaws in one of his papers, but apparently that inquiry was handled by one of his postdocs. So maybe the postdoc never informed Wansink of the problem, or maybe Wansink just thought this was a one-off in his lab, somebody else’s problem, and ignored it.)

When we look at things from the perspective of Wansink receiving nothing but acclaim for so many years and from so many sources (from students and postdocs in his lab, students in his classes, the administration of Cornell University, the U.S. government, news media around the world, etc., not to mention the continuing flow of accepted papers in peer-reviewed journals), the situation becomes more clear. It would be a big jump for him to accept that this is all a house of cards, that there’s no there there, etc.

Here’s an example of how this framing can help our understanding:

Someone emailed this question to me regarding that original “failed study” that got the whole ball rolling:

I’m still sort of surprised that they weren’t able to p-hack the original hypothesis, which was presumably some correlate with the price paid (either perceived quality, or amount eaten, or time spent eating, or # trips to the bathroom, or …).

My response:

I suspect the answer is that Wansink was not “p-hacking” or trying to game the system. My guess is that he’s legitimately using these studies to inform his thinking–that is, he forms many of his hypotheses and conclusions based on his data. So when he was expecting to see X, but he didn’t see X, he learned something! (Or thought he learned something; given the noise level in his experiments, it might be that his original hypothesis happened to be true, irony of ironies.) Sure, if he’d seen X at p=0.06, I expect he would’ve been able to find a way to get statistical significance, but when X didn’t show up at all, he saw it as a failed study. So, from Wansink’s point of view, the later work by the student really did have value in that they learned something new from their data.

I really don’t like the “p-hacking” frame because it “gamifies” the process in a way that I don’t think is always appropriate. I prefer the “forking paths” analogy: Wansink and his students went down one path that led nowhere, then they tried other paths.

2. People keep pointing me to a recent statement by Daniel Kahneman in a comment on a blog by Ulrich Schimmack, Moritz Heene, and Kamini Kesavan, who wrote that the “priming research” of Bargh and others that was featured in Kahneman’s book “is a train wreck” and should not be considered “as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.” Here’s Kahneman:

I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited.

What the blog gets absolutely right is that I placed too much faith in underpowered studies. As pointed out in the blog, and earlier by Andrew Gelman, there is a special irony in my mistake because the first paper that Amos Tversky and I published was about the belief in the “law of small numbers,” which allows researchers to trust the results of underpowered studies with unreasonably small samples. We also cited Overall (1969) for showing “that the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of invalid rejections of the null hypothesis among published results.” Our article was written in 1969 and published in 1971, but I failed to internalize its message.

My position when I wrote “Thinking, Fast and Slow” was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me – it is why I think people should believe in climate change. But the argument only holds when all relevant results are published.

I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions. A case can therefore be made for priming on this indirect evidence. But I have changed my views about the size of behavioral priming effects – they cannot be as large and as robust as my chapter suggested.

I am still attached to every study that I cited, and have not unbelieved them, to use Daniel Gilbert’s phrase. I would be happy to see each of them replicated in a large sample. The lesson I have learned, however, is that authors who review a field should be wary of using memorable results of underpowered studies as evidence for their claims.

Following up on Kahneman’s remarks, neuroscientist Jeff Bowers added:

There is another reason to be sceptical of many of the social priming studies. You [Kahneman] wrote:

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions.

However, there is an important constraint on subliminal priming that needs to be taken into account. That is, they are very short lived, on the order of seconds. So any claims that a masked prime affects behavior for an extend period of time seems at odd with these more basic findings. Perhaps social priming is more powerful than basic cognitive findings, but it does raise questions. Here is a link to an old paper showing that masked *repetition* priming is short-lived. Presumably semantic effects will be even more transient.

And psychologist Hal Pashler followed up:

One might ask if this is something about repetition priming, but associative semantic priming is also fleeting. In our JEP:G paper failing to replicate money priming we noted:

For example, Becker, Moscovitch, Behrmann, and Joordens (1997) found that lexical decision priming effects disappeared if the prime and target were separated by more than 15 seconds, and similar findings were reported by Meyer, Schvaneveldt, and Ruddy (1972). In brief, classic priming effects are small and transient even if the prime and measure are strongly associated (e.g., NURSE-DOCTOR), whereas money priming effects are [purportedly] large and relatively long-lasting even when the prime and measure are seemingly unrelated (e.g., a sentence related to money and the desire to be alone).

Kahneman’s statement is stunning because it seems so difficult for people to admit their mistakes, and in this case he’s not just saying he got the specifics wrong, he’s pointing to a systematic error in his ways of thinking.

You don’t have to be Thomas W. Kuhn to know that you can learn more from failure than success, and that a key way forward is to push push push to understand anomalies. Not to sweep them under the rug but to face them head-on.

3. Now return to Wansink. He’s in a tough situation. His career is based on publicity, and now he has bad publicity. And there no easy solution for him, as once he starts to recognize problems with his research methods, the whole edifice collapses. Similarly for Baumeister, Bargh, Cuddy, etc. The cost of admitting error is so high that they’ll go to great lengths to avoid facing the problems in their research.

It’s easier for Kahneman to admit his errors because, yes, this does suggest that some of the ideas behind “heuristics and biases” or “behavioral economics” have been overextended (yes, I’m looking at you, claims of voting and political attitudes being swayed by shark attacks, college football, and subliminal smiley faces), but his core work with Tversky is not threatened. Similarly, I can make no-excuses corrections of my paper that was wrong because of our data coding error, and my other paper with the false theorem.

P.S. Hey! I just realized that the above examples illustrate two of Clarke’s three laws.

44 thoughts on “Pizzagate and Kahneman, two great flavors etc.

  1. Kahneman’s comment has a kind of beauty to it in it’s forthrightness and willingness to acknowledge wrongdoing. The kind of thing you see with very good people in your personal life sometimes, but basically never (in my experience) in academia. That, plus your reference to Kuhn, makes this one of my favourite posts of the year. And I thought I was here mainly for the take-downs of rubbish research and practical statistical advice for my own research!

    • JB said,
      “The kind of thing you see with very good people in your personal life sometimes, but basically never (in my experience) in academia.”

      My experience in academia (especially in math) has been different. A couple of examples:

      1. When I was a graduate student, a professor stated an “If and only if” theorem, proceeded to give the proof, got through the “if” part, but got stuck in the “only if” part. In fairly short order, I came up with a counterexample of the “only if” part, so raised my hand (despite my extreme shyness at that time) and told him the counterexample. It was toward the end of class, so he quit early. He came up to me later at tea, and thanked me for giving him the counterexample, saying that if I hadn’t brought it up, he would have made a fool of himself spending the rest of the class time trying to prove it.

      2. A few years later, I was reading a book by another professor I had in graduate school. I realized that there was an error in it. So I wrote to the author pointing out the error. In short order, I got a reply from him starting, “Oh, is my face red!”

  2. This quote really stands out to me:

    When we look at things from the perspective of Wansink receiving nothing but acclaim for so many years and from so many sources (from students and postdocs in his lab, students in his classes, the administration of Cornell University, the U.S. government, news media around the world, etc., not to mention the continuing flow of accepted papers in peer-reviewed journals), the situation becomes more clear. It would be a big jump for him to accept that this is all a house of cards, that there’s no there there, etc.

    When I consider this perspective, I say to myself: I can’t expect the media to change what they do (chase readership), and students are more likely to model themselves on senior faculty, not shape them. But I’m an editor, and an (increasingly!) senior faculty member, with increasing but still very limited ability to shape research norms and institutional actions at my school.

    What can people like me do to encourage better research?

    It’s easy to say “editors should accept good papers and reject bad ones”, but doing it is not so easy. Should we be devoting more resources to peer review–but how? Is it enough to be quick and definitive with corrections and retractions after problems come to light with published papers? That is painful but frankly seems easier than improving peer review itself. Or is it more a matter of publishing replications?

    It’s harder to even say what senior faculty should do to improve research at their schools. I think there is good reason to let research communities police themselves as much as possible, rather than having that done researchers from other communities (cf. the physicist problem), much less outsourcing the taskto administrators. But communities don’t always do this effectively, so what is an outside-the-field but inside-the-school colleague to do?

    • Here are some suggestions of things people like you can do that might help, or might give you better ideas:

      1. Support Retraction Watch financially (See the link from the right hand column of their home page.)

      2. Look over the Retraction Watch site periodically for for articles that might be worth sharing with colleagues or students or discussing in a journal club. (Or maybe start a Retraction Watch club that meets once a month to discuss an article from Retraction Watch, with rotating members leading the discussion.)

      3. The Retraction Watch article http://www.cdnsciencepub.com/blog/how-to-write-a-scientific-peer-review-a-guide-for-the-new-reviewer.aspx might be suitable to refer reviewers to — or might give ideas for a similar document that would be more appropriate for you field.

      • RetractionWatch is great, but I think the true long-term solution is what Prof Gelman advocates, to normalize post-publication review. The problem with retractions–and why they are just the tip of the iceberg–is that there is a lot of stigma attached to it, so researchers do not want to retract papers. This is especially true if there is no fraud or plagiarism involved, but it’s just a dumb mistake (data entry, coding, or interpretation) that anyone might make. Instead, by making post-publication review perfectly normal, and taking away the stigma, we can improve science by correcting mistakes without embarrassing researchers. The point is that no study is perfect, and we want to build and improve, not mock and put down.

        Of course, fraud and plagiarism are a different animal, and those researchers *should* be embarrassed and stigmatized.

        • Jack:

          Your point has merit, but I think that my suggestions are still good starts for someone wanting to help improve the situation now. In particular, Suggestion 2 and Suggestion 3 can be done in ways that do not promote stigma, and can help promote the normalization of post-publication review. Also, a suggestion I neglected to mention:

          4. Participate in Pub Peer, and encourage students and colleagues to do so. (But make every effort to make criticism constructive and respectful rather than aggressive.)

  3. Was anyone else a little unsatisfied with the Retraction Watch interview? They seemed to let him off the hook pretty easily.

    Take the 150 errors in the pizza papers for example. Wansink explains that the granularity errors can be explained by people skipping survey questions. This is true, that is always a possible explanation for granularity errors, and we mention that in our article.

    BUT GRANULARITY ERRORS WERE ONLY THE TIP OF THE ICEBERG

    In addition to granularity inconsistencies, there are many discrepancies which simply have no plausible explanations. Take Tables 2 and 3 in “Low prices and high regret: how pricing influences regret at all-you-can-eat buffets.” The tables contain the exact same data, or at least are supposed to contain the exact same data. Yet many numbers inexplicably differ, how does someone explain that?

    How do you explain degrees of freedom that are larger than the total sample size?

    How do you explain incorrect cm to inches, kg to pounds, or BMI calculations?

    How do you explain some of the grossly incorrect test statistics?

    Needless to say, if/when they issue their explanation for all 150 errors I’m going to grab some popcorn.

    • Jordan:

      I think the Retraction Watch interview was fine. It’s true that they didn’t pursue all of Wansink’s responses to their logical conclusions, but that’s not what they were trying to do. The interview provided information that is helpful in our understanding of the situation: in particular, it makes it clear that Wansink does not have a handle on his own data and that, even after his contrite statement, he continues to hold to the view that his papers are basically correct, both in methodology and in conclusions.

      I think of Retraction Watch’s “job” as to provide news and to provide a place for discussion. When it comes to the technical back-and-forth, it makes sense for them to link to posts such as yours.

  4. 1. The extent to which Wansink is in trouble is the extent to which the university cares, which I would bet isn’t all that much because his work brings in money and it’s for industry not “science” in the AAAS sense. The general public doesn’t give a hoot. In fact, this would be good publicity in the public eye because of all the eating and food related reports, arguments, etc. out there, how many do you think are actually good “science” – meaning investigatory method is more than defensible and is somewhat rigorous and the analysis is at the same level? Outside of biology papers about how this or that metabolizes, maybe none? So in the public eye, if anyone in the public actually paid attention, which is highly doubtful, this would be the best case: that his work has been examined in some measure. In that universe, in that space, there is no rigor. I doubt his industry client give a hoot because they work with consultants all the time and, speaking as a someone who has been in the belly of that beast, in that field the concept of genuine rigor is largely unknown.

    2. Vulcan. To respond regarding DK: the repeated false readings of the existence of Vulcan can be compared to a series of low-powered or noisy studies in which people found signal that wasn’t there. This did, in fact, indicate a truth but that truth was the idea of Vulcan was wrong. The idea is simple: by prior experience, this leads to that so therefore the same must be true here and all we really need to do is peer harder into the noise but sometimes that noise means this no longer leads to that. Hard to blame people when it turns out you need general relativity to solve that problem. But the issue is actually embedded in the scientific method: we chase the devil into the thin air because the alternative tends toward belief in the supernatural, in witches or in illogical associative leaps that end at “the Jews” or the “Tri-Lateral Commission”. That is, to pick your favorite, everyone knows that if someone looks menacing, then you may move away because sometimes it’s really obvious that a power pose actually conveys a threat or some other information. We could assume power poses all the time, like a bunch of posturing apes – and we really do, don’t we? – or we could investigate it using the scientific method … except investigating this kind of thing moves quickly into the thin air where you think you detect a signal this time but not this time. I sometimes think about the descriptions of barbarians meeting the Romans. The barbarians would typically put on extensive displays of bravado and then be mowed down by the disciplined legionaries, a direct failure of power pose in one sense but also the evolution of power pose away from hooting and hollering toward a different power pose in which efficiency of formation conveys to the lesser organized a sense of real impending “we’re gonna die”. This isn’t a digression: while you could do really nice analysis of how we convey power at various levels from individual dress and speech to armies, etc., the scientific method fails when you apply it to specific situations. Believe it or not Wansink kind of gets at the explanation for this: you face so many decisions in a day, so many choices that involve food, that you can’t be consistent. This also expresses in the weird studies about willpower, including the idiotic ones that treat it as a quantity which erodes over time like some sort of isotope with a 12 hour half-life. Lack of consistency is built in so DK is right that repeated investigations prove something; they prove a lack of consistency. That expresses obviously in low power, noisy situations because each occasion is a separate trial which embodies many levels of confounding factors which don’t express in the same order with the same weighting each time.

    • …of all the eating and food related reports, arguments, etc. out there, how many do you think are actually good “science” – meaning investigatory method is more than defensible and is somewhat rigorous and the analysis is at the same level? Outside of biology papers about how this or that metabolizes, maybe none?

      This is why I worry about the SMBC “pet physicist” linked to above. Only studies of metabolism can be “good science”? Yikes! Of course, after universities boot the social scientists, we’ll find that the chemists will the same way about the biologists, so scratch them too, and so on. I don’t see this as a very fruitful path.

    • Jonathan:

      1. You write, “The extent to which Wansink is in trouble is the extent to which the university cares, which I would bet isn’t all that much because his work brings in money and it’s for industry not ‘science’ in the AAAS sense. The general public doesn’t give a hoot.”

      To which I reply: but it is “science” in the PPNAS sense!

      More seriously: (a) I think there are people at the university who care (consider that letter sent by the dean); (2) It’s not clear that industry will want to keep paying Wansink if his research is discredited. Someone at these companies have to sign the checks, right?; (c) The general public might not care much about “science in the AAAS sense” but public attitudes are modulated through the news media. If enough bad news comes out, at some point even NPR and Ted aren’t going to be promoting this work, book publishers are going to be wary, etc.

      To put it another way: I agree with you that quantitative science is only a small part of Wansink’s contributions, and that not much is changed if you take the quant studies away while keeping all the qualitative ideas, all the industry connections, the Ivy League professorship, and all the fame and publicity. But . . . I suspect much of all those other things flow from the perception that the quantitative research is strong. Take away the hundreds of peer-reviewed publications, and there’s not much motivation for everyone else to play along.

      2. Your Vulcan discussion is interesting.

      • Related to how much the university cares about all this, note that Cornell’s three (!) business schools are in the middle of a merger of sorts into a single College of Business.

        Wansink is in one of the lower-status business schools (i.e. not Johnson Graduate School of Management), so I can imagine that colleagues in Johnson may take this as further evidence that the merger is diluting their brand and their standards for faculty hires/promotions and graduate students.

        • Anon:

          Wow, 3 business schools! That’s pretty funny. It’s like one of those hotels that has a fancy restaurant, a casual dining spot, and a bar. If Cornell merges the three b-schools, won’t they lose some of their product differentiation?

    • > embedded in the scientific method: we chase the devil into the thin air because the alternative tends toward belief in the supernatural
      Well I guess if the scientific method is given up on, other methods of pseudo-inquiry will take over the discourse.

      In the end the only justification we have for the scientific method is hope – we hope it well get us closer to the _truth_.
      (Peirce’s metaphor for this was a military commander learning he was surrounded by a formidable enemy, pulls out his revolver and shoots himself in the head.)

    • Nick:

      Wow. The incentives to misrepresent the results of a study are really strong. Think of all the research that never made it into Jama Pediatrics because the authors weren’t willing to cheat. Perhaps we could get Susan Fiske interested in this. Entire careers could’ve been ruined by the eagerness of top journals such as JPSP, Psych Sci, PPNAS, and Jama Pediatrics to publish papers that fall somewhere on the spectrum from fraud to complete crap.

    • Nick, I read your blog post. Is this approaching the area of research fraud now? Could they be using preliterate children from a nursery and presenting them as 8-11 yr olds? How can one account for the differences in the figures? The data must be made available; surely the prestigious journal can chivvy Wansink along, or the funding agency that supported this work.

      A more general question: If you manipulate the data to the extent that you can get radically different result than you would get otherwise, is that research misconduct calling for serious disciplinary action? Or is that a questionable research practice, meaning wink wink nothing to see here, let’s move on?

  5. Given that Kahneman’s book is all about “Thinking Fast and Slow,” I love this bit from his response on the R-Index blog: “I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through.” System 1 strikes again.

  6. I see and accept that Kahneman is admitting to his errors in a forthright way. It puzzles me, though, that he continues to say that “belief in well-supported scientific conclusions is not optional.” This is profoundly wrong. I’m surprised he hasn’t questioned himself on this. Hasn’t he read “No Trump”?

    I thought a bit about why I consider his assertion flat-out (and deep-down) wrong. Here’s what I have come up with so far.

    I think he’s aiming at something potentially reasonable but doesn’t put it quite right. Here’s a tentative reframing of the idea:

    Belief is always optional. That’s part of what makes it belief. You are presented with something, and you choose whether or not to believe it. You can also choose whether or not to be rational about things.

    However, scientific discussion is both logical and empirical; it requires the willingness and ability to advance knowledge and thinking about a topic. Thus, if there is overwhelming evidence–and solid argumentation–in favor of a given hypothesis or theory, any alternative theory must address this. It is not enough to say, “What do we know?” or “There’s conflicting evidence out there.” We must contend with the evidence and logic at hand.

    That’s markedly different from the statement that “belief in well-supported scientific conclusions is not optional”–but I think it may get at his underlying intent. The wording makes a difference.

    • Actually, belief in the existence of any meaningful notion of “optional” is in a centuries long decline.

      Even reactionaries like me who accept “free will” are not necessarily comfortable with the notion that “belief” is a matter of choice. It is quite complicated, and depends on the meaning of the word. Beliefs are mutable, but still not really chosen. Try this: you do not really have the ability to choose to believe, say, that you will be able to stand on water if you step out of a boat. You can profess it, you can even act as if you believed it. But that is not quite the same as believing it.

      • You are right. It is complicated, and it does depend on what you mean by “belief.” If, drawing on Dan Sperber and others, I distinguish between intuitive and reflective belief, I have not made things simpler. Even reflective belief is not entirely willed, though it involves a complex rational process with choices along the way.

        So then the question becomes: What does Kahneman mean by “belief in well-supported scientific conclusions”? I doubt he means “instinctive trust in the veracity” of said conclusions. I suspect, rather, that he is formulating the statement in an ambiguous and (unintentionally) misleading way. He seems to use “belief in” to mean “rational acceptance of.”

        It seems to me that he means, “You can decide whether to accept the conclusions of the research, but when the evidence and argumentation overwhelmingly favor a proposition, there is only one viable choice: to accept it as true.” If that is so, he is affirming and negating choice at the same time.

        So I would reword my original rewording to say, “We may at any point choose whether or not to *accept* the conclusions of research–that is, whether to treat them as true or to subject them to further questioning. However, in the context of scientific discussion, we must be prepared to address the existing body of evidence and argumentation.”

        An example: Apparently many people still “believe” in power poses. But now that scholars have pointed out serious flaws in the original power pose study, and now that several replications have failed, a participant in scientific discussion of the topic must address these developments instead of evading them–*if* he or she wishes to be taken seriously and advance the discussion.

        In any case, it seems that Kahneman has misformulated his point.

  7. Hi Andrew

    I really appreciate your work in this area.

    I was wondering about your comment at the end where you had a correction in your own work due to a data coding error. The correction was that it should be assumed that all of Section 3 is wrong until proven otherwise. Yet Section 3 is the whole empirical section of the paper. If all of the results are unreliable, I was curious about the reasoning behind posting this statement as a correction rather than a retraction, and why you didn’t present a re-analysis of the data after the fixing the coding error. As I’m interested in the topic, I was left wondering what the study did find that can be relied on.

    • Paul:

      1. The paper has four sections. I thought sections 1, 2, and 4 were ok (with some of section 4 changing after discarding section 3). I don’t really care if something’s called a correction or a retraction. I could’ve just as well said that we “retract” section 3.

      2. The student who found the error was starting a re-analysis, also looking at data from other years, and then he graduated and moved on to other things. So I thought it would best to just issue the correction (or, if you prefer, the retraction) of the empirical section of the paper. I’d be interested in returning to the project at some point but it would take work to do that and it hasn’t been my highest priority.

  8. ‘Second, who eats 12 pieces of pizza? I guess they must be really small pieces!’
    They may be rare on Columbia’s campus, but millions of people in America regularly eat *at least* that much at a single meal…

    • Jack, here is the exact quote from Dr. Wansink in the RW interview:
      “Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.”

      So they just happened to accidentally design a study in which they measured the same thing in two different and likely mutually contradictory ways. As Dr. Wansink goes on to say, “It was a just a silly way to ask the question.” But hey, I bet researchers at Los Alamos and CERN make this kind of silly mistake all the time.

      After all, asking people how many slices of pizza they just ate is a pretty tough measurement problem. I would certainly have no idea how to get that right. I’m glad we have Ivy League schools working on this vital question, because obviously you couldn’t get a third-grader to solve it.

      And, better still, “the coauthors themselves have since looked at these” (whatever “these” means). After all, who better to clear up inconsistencies than the original coauthors, when independent people might make mistakes because they don’t understand all of the subtleties of the complex methods that were used. OK? Glad we agree! Thanks for the terrific discussion, Jack! Now let’s move on!

      • Nick:

        You write, “I’m glad we have Ivy League schools working on this vital question, because obviously you couldn’t get a third-grader to solve it.”

        Are you kidding? The IRB would throw a fit if you were to hire a third grader to do your survey designs. Child labor and all that.

        • Andrew, check out the article that is linked to from the February 16 Retraction Watch post. (You can have either go to the question from Alison McCook, then to Jordan Anaya’s blog post, and then to the article, or you can go to the abstract here https://www.ncbi.nlm.nih.gov/pubmed/22846502.)

          Author #4 on this study is listed as Matthew Z. Klinger, whose affiliation is “Half Hollow Hills High School East, 50 Vanderbilt Parkway, Dix Hills, NY 11746, USA”.

          The abstract states, “The scalability of this is underscored by the success of Study 2, which was implemented and executed for negligible cost by a high school student volunteer”. The Methods section of Study 2 states, “To investigate the ease of implementation and potential scalability of this method, a high school student was recruited to conduct the study. He received school credit for his work.” The Discussion section describes him as a “sophomore” (which I understand corresponds to the 10th grade, ages 15-16), who “executed the study at a negligible cost” (although they declare that the Study was “funded by the United
          States Department of Agriculture’s Food Nutrition Service and Economic Research Center”, so it’s not clear what they needed money for). Of course, the study was conducted “After obtaining approval from Cornell University’s Institutional Review Board”.

          So apparently Cornell’s IRB is just fine with having research on elementary school kids being conducted by minors, over a period of two months and with “40,778 total child-day observations”. No adult supervision of Study 2 is reported, nor is it made clear how Matthew Z. Klinger arranged to visit two school lunchrooms every day(*) for two months to conduct these 40,778 observations while also completing his own high school program (and having his own lunch). I can imagine many PIs who would be glad to have an RA who worked that hard.

          Incidentally, there is an earlier draft of this article at
          http://smarterlunchrooms.org/SoftChalk_Lessons/NC_Module03/87241607-Attractive-Names-Manuscript.pdf. Some of the differences between the draft and the published article are quite interesting, not least the fact that Matthew Z. Klinger was not mentioned as an author in the draft. I guess they figured that recording 40,778 data points probably merited authorship.

          (*) The draft manuscript also makes it clear that the synchronisation of the menus across the schools was because “These two schools are serviced by the same food preparation facility and had nearly identical menus. Thus, the vegetable side dishes offered in one school on a particular day will also be offered in the other school on the same day”. That means that all of the observations in both schools were performed on the same days.

        • NIck:

          Not to go all “free range kids” on you, but I don’t think it’s so horrible for a young high school student to do this sort of research project. I agree that it would be uncool not to have included the student as a coauthor on the article, after all the work he did.

        • Nick: The high school student received some kind of credit from his school for working on this study. Perhaps it substituted for a required term paper, a required science project, or some such. We don’t know how much supervision he did or did not receive, and there is not much description of what he actually did. He may have done enough to justify fourth authorship, who knows? I think educational experiences like this are good for high school students.

          I don’t think that he went to two school lunchrooms every day for two months to record observations. The cash registers were set up to separate selection of a cold vegetable from selection of a hot vegetable from selection of no vegetable. The observations were probably collected directly from the cash registers’ daily accountings.

      • Nick: The sarcasm/snark directed at Jack was not necessary. Also, there *is* such a format for rating scales (a line anchored at each end with a number but no other numbers). This is called a “visual analog scale” and might be used to assess degree of pain felt by a hospital patient, for example. Sometimes people use more than one rating scale format in a study. It does seem rather silly to use such a scale for counting the number of pizza slices eaten, though.

        • I apologise to Jack if he thought that the exasperated tone of my comment was aimed at him – it wasn’t. I provided the quote in the first paragraph because I guessed that Jack probably hadn’t read the RW post, which makes it clear that the number was indeed 12. Everything else was intended as satire on the general level of communication that has being coming out of Cornell on this. I could, however, have made a better job of that.

  9. Hello Prof. Gelman,
    I am a big fan of your blog, and this topic. I believe that at undergraduate and graduate-level (Masters) courses, we are taught how to run various experiments using various software, and there is less focus on what not to do (or be cautious about things) when doing statistical analyses. Apart from your blog, do you have any recommendation on any book / literature that could improve my analyses? As an aspiring statistician, I want to avoid similar mistakes pointed out by yourself. One of the books that I bought recently is “how to lie with statistics.” Very respectfully and humbly, it will help me if you could point to any resource that could make my analyses airtight.

    Thanks in advance for your help.

  10. Andrew wrote,
    “This is weird for two reasons. First, how do you say “we realized we asked . . .”? What’s to realize? If you asked the question that way, wouldn’t you already know this?”

    I know someone who sometimes says, “I realized …” to mean that she finally got something that other people had been trying to tell her earlier. It’s like what other people say doesn’t register — she has to somehow “see” it for herself.

  11. Very interested in your opinion on Ole Peters’ work on ergodicity ergonomics and the psychology and economics literature on risk (also talked about in Thinking Fast and Slow). From my understanding, Ole’s work at least warrants a reviewed interpretation of the results for a lot of that literature.

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *