John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice

Several people pointed me to this awesome story by John Bohannon:

“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. . . .

My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

How did the study go?

5 men and 11 women showed up, aged 19 to 67. . . . After a round of questionnaires and blood tests to ensure that no one had eating disorders, diabetes, or other illnesses that might endanger them, Frank randomly assigned the subjects to one of three diet groups. One group followed a low-carbohydrate diet. Another followed the same low-carb diet plus a daily 1.5 oz. bar of dark chocolate. And the rest, a control group, were instructed to make no changes to their current diet. They weighed themselves each morning for 21 days, and the study finished with a final round of questionnaires and blood tests.

A sample size of 16 might seem pretty low to you, but remember this, from a couple of years ago in Psychological Science:

Screen Shot 2015-05-29 at 8.58.35 AM

Screen Shot 2015-05-29 at 8.58.53 AM

Screen Shot 2015-05-29 at 8.59.10 AM

So, yeah, these small-N studies are a thing. Bohannon writes, “And almost no one takes studies with fewer than 30 subjects seriously anymore. Editors of reputable journals reject them out of hand before sending them to peer reviewers.” Tell that to Psychological Science!

Bohannon continues:

Onneken then turned to his friend Alex Droste-Haars, a financial analyst, to crunch the numbers. One beer-fueled weekend later and… jackpot! Both of the treatment groups lost about 5 pounds over the course of the study, while the control group’s average body weight fluctuated up and down around zero. But the people on the low-carb diet plus chocolate? They lost weight 10 percent faster. Not only was that difference statistically significant, but the chocolate group had better cholesterol readings and higher scores on the well-being survey.

To me, the conclusion is obvious: Beer has a positive effect on scientific progress! They just need to run an experiment with a no-beer control group, and . . .

Ok, you get the point. But a crappy study is not enough. All sorts of crappy work is done all the time but doesn’t make it into the news. So Bohannon did more:

I called a friend of a friend who works in scientific PR. She walked me through some of the dirty tricks for grabbing headlines. . . .

The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text.

Take a look at the press release I cooked up. It has everything. In reporter lingo: a sexy lede, a clear nut graf, some punchy quotes, and a kicker. And there’s no need to even read the scientific paper because the key details are already boiled down. I took special care to keep it accurate. Rather than tricking journalists, the goal was to lure them with a completely typical press release about a research paper.

It’s even worse than Bohannon says!

I think Bohannon’s stunt is just great and is a wonderful jab at the Ted-talkin, tabloid-runnin statistical significance culture that is associated so much with science today.

My only statistical comment is that Bohannan actually understates the way in which statistical significance can be found via the garden of forking paths.

Bohannan’s understatement comes in a few ways:

1. He writes:

If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. . . .

P(winning) = 1 – (1-p)^n [or, as Ed Wegman would say, 1 – (1-p)*n — ed.]

With our 18 measurements, we had a 60% chance of getting some “significant” result with p < 0.05.

That’s all fine, but actually it’s much worse than that, because researchers can, and do, also look at subgroups and interactions. 18 measurements corresponds to a lot more than 18 possible tests! I say this because I can already see a researcher saying, “No, we only looked at one outcome variable so this couldn’t happen to us.” But that would be mistaken. As Daryl Bem demonstrated oh-so-eloquently, there many many possible comparisons can come from a single outcome.

2. Bohannon then writes:

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works”.

Sure, but it’s not just that. As Eric Loken and I discussed in our recent article, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Even if a researcher only performs a single comparison on his or her data and thus did not do any “fishing” or “fiddling” at all, the garden of forking paths is still a problem, because the particular data analysis that was chosen, is typically informed by the data. That is, a researcher will, after looking at the data, choose data-exclusion rules and a data analysis. A unique analysis is done for these data, but the analysis depends on those data. Mathematically this of course is very similar to performing a lot of tests and selecting the ones with good p-values, but it can feel very different.

I always worry when people write about p-hacking, that they mislead by giving the wrong impression that, if a researcher performs only one analysis on his her data, that all is ok.

3. Bohannon notes in passing that he excluded one person from his study, and elsewhere he notes that researchers “drop ‘outlier’ data points” in their quest for scientific discovery. But I think he could’ve emphasized this a bit more, that researcher-degrees-of-freedom is not just about running lots of tests on your data, it’s also about the flexibility in rules for what data to exclude and how to code your responses. (Mark Hauser is an extreme case here but even with simple survey responses there are coding issues in the very very common setting that a numerical outcome is dichotomized.)

4. Finally, Bohannon is, I think, a bit too optimistic when he writes:

Luckily, scientists are getting wise to these problems. Some journals are trying to phase out p value significance testing altogether to nudge scientists into better habits.

I agree that p-values are generally a bad idea. But I think the real problem is with null hypothesis significance testing more generally, the idea that the goal of science is to find “true positives.”

In the real world, effects of interest are generally not true or false, it’s not so simple. Chocolate does have effects, and of course chocolate in our diet is paired with sugar and can also be a substitute for other desserts, etc etc etc. So, yes, I do think chocolate will have effects on weight. The effects will be positive for some people and negative for others, they’ll vary in their magnitude and they’ll vary situationally. If you try to nail this down as a “true” or “false” claim, you’re already going down the wrong road, and I don’t see it as a solution to replace p-values by confidence intervals or Bayes factors or whatever. I think we just have to get off this particular bus entirely. We need to embrace variation and accept uncertainty.

Again, just to be clear, I think Bohannon’s story is great, and I’m not trying to be picky here. Rather, I want to support what he did by putting it in a larger statistical perspective.

85 thoughts on “John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice

  1. No comment on the ethics of involving human subjects (with blood draws!) in a study intended to produce absolutely no new knowledge? It’s not like you can’t create a perfectly terrible p-hacked piece with existing data.

    • Yes, it would never get past an IRB that was awake. Well the whole thing never would, because it is all about deceiving people and spreading false information. So wrong on so many levels, but especially given that apparently an actual doctor was involved who should have known about the standards for conducting research. So I predict that rather than the expected effect the likely outcome is that he gets sanctioned.

      • Even if this somehow wouldn’t count as conducting research — or perhaps especially if this doesn’t count as research — a blood draw without medical indication is serious ethical no-no.

        • “a blood draw without medical indication is serious ethical no-no”

          That’s not true. There are plenty of ethical studies in which blood is drawn only for the purpose of gathering research data, with no medical indication whatsoever. It is quite routine, and IRBs approve such studies all the time. As long as the participant understands the purpose of the blood draw, the associated risks, and consents to it, there is no ethical problem.

        • The IRB can allow it if the benefits of the study outweigh the harms/risks of the blood draw and if participants know in advance that they will be having it.

        • Yes, which is why I prefaced the part you quoted that with “if this somehow wouldn’t count as research.” Making the blood draw medical (mal) practice.

        • As a different article pointed out, the IRB guidelines for what is or isn’t medical research (and presumably what is or isn’t ethical in general) are different in Germany than the US.

        • I think this discussion is missing a key point.

          The study is very much designed to produce new scientific knowledge and succeeded spectacularly at this. It’s just that the knowledge it aims for is not what one might gather from naively reading the paper that the journal published.

          This ethical critique is almost like saying that Swifts “Modest Proposal” is unethical because it would be wrong to eat Irish babies. Such a critique would totally miss the whole point.

          I do think that an ethics discussion on this work could be illuminating but it shouldn’t proceed from the basis that there was no attempt to generate scientific knowledge.

        • What was the new knowledge it produced? That p-hacking can produce misleading results? That predatory journals publish shoddy work without peer review? We knew all this already.

        • Agreed. This seems somewhat analogous to a study that would use blood draws to test the performance of some blood analysis assay — one wouldn’t learn anything about blood or disease, but one would generate knowledge about methods. Here, the method is the whole system of public propagation of scientific (mis)-information. I

        • Bea:

          Science isn’t just about new knowledge, it’s also about communication. In this exercise, Bohannan made a useful contribution to science communication.

        • I was going to make this point in response to the root comment, but then I decided that this is not best categorized as social science research but rather as an act of… call it ‘gonzo advocacy’.

        • As far as I know Swift didn’t actually claim to do scientific research nor did he actually eat any babies or cause anyone to.
          Further, even if he were a research scientist engaging in an experiment involving eating babies, he wrote long before the current thinking about human research participants was developed. In his day it would have been considered fine to experiment on orphans or prisoners without their consent.

          If he had said to the participants “I would like you to participate in an experiment to see if journalists will write articles about an article that reports data with a ridiculous sample size and that engages in something called “p hacking” and as part of that I would like you to undergo an unnecessary medical blood draw (which is painful and has low (but not no) risk of negative impacts). As part of this study you may be assigned to a group that is asked to modify its eating behavior. We expect that this change will have no impact on your weight or health status” then he would have been okay.

    • This project simply doesn’t fit the paradigm of “clinical trials” for which IRBs were invented (to redress a history of ethical shortcomings, some quite egregious). It started out life as a documentary, with the “clinical trial” being a play within a play, so to speak. Given that it doesn’t fit the standard framework, the usual presumptions about the need for an IRB need to be checked carefully to see whether they really apply in this unusual case. Is there an ethical problem? People are pointing to blood draws as the key. But there are any number of activities for which participation is contingent on a medical exam that might include a blood draw. Do we want to mandate that all those activities also pass an IRB? (An example would be participating in a “Survivor”-style reality TV show. In fact, the documents for Survivor dictate that the medical exam be performed by a physician specified by the producers, so it’s not even in the context of an existing doctor-patient relationship.)

      I think it’s incumbent on those upset about this exercise because of the bureaucratic lack of IRB involvement to explain what actual ethical lapse occurred. Drawing someone’s blood and weighing them for 3 weeks in exchange for 150 euros seems like a straightforward transaction of the sort that happens all the time. (I only get a cookie and juice when I give blood.) What’s the problem?

      • As I understand it, the first blood draw was probably directly beneficial to the subjects.If any irregularities were found, i am sure they would be informed to take care of the problem.

      • It is not that a problem happened but the kinds of problems that *could* have happened that we need an IRB for.

        IOW, I may not fault an IRB for approving this work. But so long as they did approach an IRB.

      • I think people are concerned about a few things. Now, this was not a scientific study so the rules of ethical research do not apply in any legal sense. There is no institute for OHRP to suspend from getting federal money etc. But on the other hand, ethics is in principle not only about following a set of rules, it’s about the underlying ideas (in the US those are mainly outlined in the Belmont Report, but basically in the end they are treat your participants fairly and honestly, do not lie to or deceive them except under unusual circumstances in which you will offer debriefing afterwards), think broadly about the potential negative impact of your work).

        The fact that you might get blood drawn (or go on a diet or get a political flyer during an election or have some stranger come up to you and talk about gay marriage) in the context of ordinary life doesn’t really have anything to do with that happening in the course of research. IRBs only are involved when the purpose of a study is to create knowledge rather than to do things for ordinary life purposes. (In fact I am really surprised that no one has asked about the consent procedures in the LaCour study. Did he debrief afterwards? I haven’t seen anything about that. I’m pretty sure it would be possible to write an appropriate informed consent form for a study designed to change you attitude on a possibly deeply held belief possibly without revealing what that belief is until after the study is completed, but I don’t think it would be simple.)

        Bohannon himself from what I can tell, might not have a lot of experience with human participants. And who knows about the doctor involved, whether he has any experience either. This con (as he describes it) would not have gotten past an IRB. He could have done the same exact con with fake data and still had the same paper submitted and published and the same journalists pick it up in the same places. What is bothering people is that he says he actually went through the motions of carrying out fake research but not to the extent of following the standard ethical guidelines for the research.

        • I perhaps missed the answer to the question, so excuse me if I did. But can you just succinctly state the answer to the posed question, namely:

          “I think it’s incumbent on those upset about this exercise because of the bureaucratic lack of IRB involvement to explain what actual ethical lapse occurred.”

          And do you believe that an actual ethical lapse occurred, or that an ethical lapse _potentially_ occurred, and we need more data?

        • Here’s an analogy: “Why am I getting a ticket for running that red light at midnight? No harm came to anyone. I didn’t crash.”

          Not only does one get penalized for actual harm but also for potential harm that may have resulted by not following laid down protocol.

          PS. It seems that Bohannon may have acted in a non-institutional setting. If so, maybe the policy did not apply. Not sure about the law.

        • I understand that concept, but that’s not what (I think) was asked.

          Should I take from your reply that no ethical lapse occurred, and you’re just making a general point about IRBs being important?

          That’s a fine opinion to hold, but we should separate ethical misconducts in research from procedural misconduct in getting the research started. The lack of an IRB is ethical misconduct insofar as it eschews a check against unethical research; that wouldn’t make the actual studies done without an IRB unethical, it would make the context in which the studies were done unethical. A technical difference, but a very important one, in my opinion.

  2. ” Even if a researcher only performs a single comparison on his or her data and thus did not do any “fishing” or “fiddling” at all, the garden of forking paths is still a problem, because the particular data analysis that was chosen, is typically informed by the data. That is, a researcher will, after looking at the data, choose data-exclusion rules and a data analysis.”

    Newton not only discovered the inverse square law of gravity after seeing the data, but he actually derived/backed-out the law from the data. It was as extreme a data dependent analysis as you could get. What’s worse is that according to the physicist Harold Jeffreys writing in the 1930’s, in the several hundred year history of Newtonian gravity up to that point, there was no time at which it would have passed a classical hypothesis test. No doubt all this is why Newton’s name is synonymous with junk science today.

    But hey, it sounds like you statisticians have this “science” thing all figured out. Just compute one p-value and make sure you decide to compute it at an acceptable moment in time. Do that and you’ll no doubt be churning out science far better than that rapscallion Newton.

    • Anon:

      This is a bit exhausting but I’ll give it one more try. No, I am not saying that Newton and Jeffreys were wrong, nor am I saying that statisticians have this “science” think all figured out. That is you attributing this statement to me. As I have consistently written, I don’t think p-values are generally a useful way of summarizing inference. If you look at my published work, you’ll see very few p-values, because I think they have lots of problems.

      As I have consistently written, if someone is going to compute a p-value, then this p-value should accurately represent the random variable T(y), as a function of possible data y. It’s like the Monty Hall problem: you don’t have to play, but if you are going to play, it makes sense to use the laws of probability.

      • Listen Andrew, I understood perfectly that you don’t care for p-values the first time, but you’re making a point of principle about p-values and you’re flat out wrong about it and your spreading a great deal of counter productive nonsense in the process.

        You have absolutely no evidence for this absurd Garden of Forked paths theory of yours except that it “makes sense” to you, but that’s just the point. What “makes sense” to you when playing the game is actually “nonsense” implying you shouldn’t play the game at all.

        • Anon:

          Sorry, no, it’s not a theory of mine, it’s just simple mathematics. The p-value is Pr(T(y.rep)>T(y)), which means it requires a definition of T(y.rep). This is not well understood. The concept of a p-value is confusing to people, and they tend to think that T(y.rep) is defined by whatever they did to the data y. But, no, the p-value is a statement about what would’ve happened, had the data been different.

        • So, to mathematicize your Garden of Forked Paths criticism, consider:

          F(f_i(T_i(y.rep) > T_i(y))|i) F(i|y)

          Where T_i is a choice of test statistic, and f_i is a probability distribution interpreted as a frequency, and _i indexes the choice of statistical test method and F is the observed frequency with which p < 0.05 across all actual applications of "hypothesis testing".

          Now, F(i|y) plays the role of the "choice of test statistic after seeing the data" and F(…|i) plays the role of checking to see how often we win the p T_i(y))|i) >> 0.05

          so that overall, p < 0.05 occurs far more often than it would if F(i) wasn't conditional on y?

          something like that I think?

        • In any case, I think if I could get my post “fixed” it would sort of make a useful point. The problem with “frequentist statistics” in which events in the world are given probabilities in the sense of a long run frequency, is that any such claim *about the world* is an empirically un-justified one. The frequency with which p < 0.05 under the null hypothesis occurs is very high because people do in fact choose their “null hypotheses” in such a way as to make them unlikely to generate their actual data.

        • you’re flat out wrong about it

          You have absolutely no evidence for this absurd Garden of Forked paths theory of yours

          No. See the Berk and Brown paper I mentioned earlier, which discusses the same issue, which can come up with any unplanned analysis. Andrew’s work makes it easily-understood but the issue is well-known among statisticians. It is easily shown to be real with simple simulations.

        • Irrespective of whether the garden is a real problem, the antidote seems clear: Mandatory Pre-registration.

          If the garden is a problem, pre-registration fixes it. If it isn’t not much harm done. (Ok, we do hamper the paper churning ability of researchers a bit. )

        • Rahul:

          As I’ve said before, I’ve never done a preregistered study in my life and I think I’ve made many useful research contributions. These contributions are often in the form of research papers. I think it’s a bit inappropriate to call this “paper churning.” Preregistration gives value in some settings and that’s fine. I think researchers should be open about their research design, and I certainly think researchers should be allowed the option of preregistration. But if non-preregistered papers were no longer published, we’d be losing a lot.

    • Do you have a cite for the proposition:
      “What’s worse is that according to the physicist Harold Jeffreys writing in the 1930’s, in the several hundred year history of Newtonian gravity up to that point, there was no time at which it would have passed a classical hypothesis test”

      I find the assertion hard to believe. Admittedly, it’s not clear to me what the null hypothesis would be—that force is inversely proportional to distance to the 1.8th power? But, highly accurate predictions had been made with that theory. For example, Newton explained variations in the orbits of Jupiter and Saturn. If Jeffreys’s point was that Bayesian reasoning provided a better tool for dealing with the truth or falsity of Newton’s theory of gravity, sure. But, there were a lot of observations that fit Newton’s theory well.


      • Jeffreys discusses this in Theory of Probability, pg 391. In discussing how Einstein’s laws improved on Newton’s laws, yet may still be imperfect, what he actually says is;

        “There has not been a single date in the history of the law of gravitation when a modern significance test would not have rejected all laws and left us with no law.”

        Incidentally, the orbits of Jupiter and Saturn do get a mention in Jeffreys discussion.

        • I don’t see the relevance of this whole Newtonian mechanics thread. As far as I know, Newton didn’t use statistical tests at all.

          And surely saying that a significance test would reject all laws and leave us with no law is _agreeing_ with Andrew’s point more than it disagrees: Andrew is definitely NOT arguing that theories should be chosen based on statistical significance tests!

          There’s nothing wrong with looking at the data before and during your data analysis! Indeed, I would say there’s something wrong with _not_ doing it. What’s wrong is p-hacking and related issues.

        • I’m not sure why Anonymous always brings up Newton, except basically that Newton’s laws and soforth were discovered totally in the absence of any statistical methodology based on p values, yet they are extremely “hard” science…

          I think you can take a couple different things home from that, choice is dependent on you:

          1) Good science can be done without p value type statistics
          2) Maybe statistics in terms of p values actually hinders things
          3) Maybe even if you pre-register and etc etc to try to make the p-value stuff “work better” you still wind up with p value concepts hindering things

          I know Anonymous advocates just using bayesian methods and dropping the whole p value schtick entirely. I honestly thing that Andrew more or less believes that as well, but I don’t think Andrew advocates for that position strongly enough for Anonymous, in other words, I think all this stuff is about Anonymous trying to get Andrew to agree that we should all be full-on bayesians and drop any pretense of NHST being maybe ok in some situations… but maybe that’s reading too much into it?

        • You don’t get to do science (know-ling) correctly, you hope to do it less wrongly – no one gets past that.

          OK, accept the uncertainties you can’t change, reduce the ones you can, yada, yada,…

        • But was Bayesian or any other type of statistics useful for Newton? The whole Newton thing seems to be a red herring.

        • Phil: “There’s nothing wrong with looking at the data before and during your data analysis! Indeed, I would say there’s something wrong with _not_ doing it. What’s wrong is p-hacking and related issues.”

          What kind of data analysis are we talking about here, exploratory or confirmatory?

          I am actually concerned about the common recommendation in statistics textbook to, as a first step, plot the data.

          If it is simply an exploratory data analysis, without a report of p-values or measures of model adequacy, then this advice sound fine.

          But it seems that if the data analysis is of the confirmatory nature, reporting p-values and measures of model adequacy, then the plotting of data just opens the door to the “garden of forking paths.” And in the ideal where data, model, methods, have been preregistered there seems to be little value in plotting the data first, except perhaps to save you time by short circuiting your analysis if you find sufficient evidence in your plot that the model you proposed is inadequate.

          Perhaps it would be better for textbooks to recommend holding off on the plotting step until the model validation phase of the data analysis.

          Too extreme?

        • The biggest problem is that people never clearly distinguish between whether their analysis is of an exploratory or confirmatory nature.

          For some reason there’s a lot of resistance to a clear categorization of studies into these classes.

        • If all of their knowledge about statistics comes from the textbooks, they could hardly be blamed for not realizing that there is a distinction.

          I think though, if they begin reporting p-values and whatnot and don’t say otherwise, then we can conclude that they are representing their analysis as confirmatory.

        • @JD

          What about the studies that do not use p-values. e.g. Bayesian work. Which of those studies are / are not confirmatory?

        • @Rahul

          That falls into the category of “whatnot”, numbers that are being report that are only valid under the circumstances of “one look” at the data.

        • I don’t know what “confirmatory” data analysis means. I’m pretty sure I’ve never done anything that would be described by that term. I’m sure there are times when a “confirmatory” analysis is the right thing to do, but I think if most people think that what they’re doing is “confirming” something, that’s a problem right there. I am definitely not a believer in the version of science that is taught in fourth grade, in which one generates a hypothesis, collects data, and tests the hypothesis to “confirm” that it is true.

          I’ve been involved in quite a few analyses of real-world data and I definitely do not recommend selecting statistical models in advance, applying them, generating and writing up the results, and only then looking at the data. In every case I’ve seen you need to look at the data to create the models in the first place and/or to deal with things like missing data, mis-coded data, etc. Here’s an example…really the closest thing to an analysis where we _could_ have laid out the entire analysis before seeing the data. Even here, though, it would have been a bad idea:
          A couple of years ago I was involved in a project to quantify the accuracy with which building energy consumption over a period of months or a year could be predicted using statistical models that are based on historical energy data from the building, along with outdoor air temperature data. We got data on a random sample of 500 commercial buildings, and tested five different statistical models ranging from the very simplest (the next year will be just like the last year) to somewhat more sophisticated ones that adjust for temperature and use different adjustments for periods when the building is or isn’t being heated or air-conditioned. The basics are quite straightforward so in principle perhaps this is a case where we could have made all of the decisions without looking at any of the data. But we didn’t: the first think I did was plot the energy consumption (actually electric load) versus time for every building. What I found was that something like 5% of the buildings had been unoccupied for part of the two-year period covered by the data. I’d have to go into more detail than is warranted here to explain in detail, but: none of the use cases for which our project would be relevant would ever include a building that was unoccupied like that. So we threw those buildings out of the dataset. We did not throw out buildings simply because their energy use patterns changed a lot or were unpredictable, because such buildings could be included in the kinds of use cases we were interested in. Unoccupied buildings, no. Of course, in the report we wrote about the project, we said we had excluded these buildings, and why.

          Anyway, to me I think that picturing the point of a data analysis as being to “confirm” something is usually a mistake. Not always, but usually.

        • @Phil:

          Take your example about building energy consumption. Since your goal was to quantify the accuracy I suppose you concluded something to that effect.

          A confirmatory study, in my mind, would be you or someone else taking another random sample of commercial buildings and examining if your models and the predicted accuracy hold on the new dataset.

          Maybe you’d call that replication or out of sample validation? To me that’s a very integral part of doing good science. If there was any utility to having you develop the initial model, surely there is utility in independently checking if your model is right.

        • JD, I’ve never done data analysis that attempts to confirm a hypothesis. I think few analyses done in the real world fit that description, and I think that often when people think that’s what they’re doing, they really should be doing something else. For example, if someone thinks “I’m trying to confirm that my company’s new drug helps prolong the life of patients with pancreatic cancer,” they’re making a mistake; they should be thinking “I’m trying to quantify the effect of my company’s new drug on patients with pancreatic cancer.”

          Rahul, if I start with a sample of 500 randomly selected buildings and quantify the performance of the models, and then someone gives me data from another 500 buildings, I would certainly check for consistency between the two samples, you’re right: in principle it would be nice if you could take people at their word when they said they used the same process to generate the second sample as the first, but in practice there are often complications and you end up with differences in the sampling procedures. But, if I believed that the second set was in fact an additional sample, I would not use it to “confirm” the results from my first study; instead I would jointly analyze all 1000 buildings to reduce the uncertainties in the quantities I’m looking at.

          I/we can come up with cases in which a “confirmatory” analysis really is what you want, but most analyses aren’t like that, or shouldn’t be.

        • Phil,

          I suspect we’re just talking past each other here. I agree that estimating effect sizes is more interesting than reaching binary Yes or No conclusions.

          But would you agree with me that if you did not have a hypothesized model in mind, plotted data and from that decided to fit a quadratic, and then reported the magnitude of this effect, that is less compelling scientifically than if you had hypothesized a quadratic effect before looking at the data, fit that model and then reported the magnitude of this effect? I think this captures some of the essence of exploratory vs. confirmatory.

        • @Phil

          My impression was that collecting data to confirm or repudiate the predictions / hypotheses / models proposed by other scientists was an integral part of how science progresses. e.g. Eddington’s observations to confirm Einstein’s theories. Or the Higgs Boson & the LHC.

          I’m totally lost when you say “point of a data analysis as being to “confirm” something is usually a mistake”

        • @Rahul: repudiate, yes. It’s unusual that one study — even at the LHC — can do more than exclude some possible explanations for previous results. If you goes into the study with the mindset of ‘confirming’ a particular theory, you’re apt to fail to apply appropriate skepticism to your own reasoning.

        • @Corey

          Fair enough. Maybe we should call them “repudiatory” studies? In any case, isn’t the distinction between a repudiatory study & an exploratory study still relevant?

          Hypothesis generation versus hypothesis repudiation seem two different exercises.

  3. As Ben Goldacre has observed, it is possible to prove almost anything when dealing with aspirin, chocolate or circumcision. Bohannon’s spoof strikes me as being truly inspired. His “study design is a recipe for false positives”:

    “the study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.”

    Read it carefully and you will see that he did everything that a typical randomized control trial is supposed to do and how it can still be deceptive and defective. Especially so because the media is desperate for headlines and ink. As Andrew has indicated, Bohannon’s strategy focuses on exploiting

    “journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text.”

    “Almost no one asked how many subjects we tested, and no one reported that number. Not a single reporter seems to have contacted an outside researcher. None are quoted.”

    For a similar satirical study of over 20 years ago which had nothing to do with aspirin, chocolate or circumcision but received media attention anyway (WSJ, TV, newspapers), try

    • This is part of what is unclear.
      Was the purpose of what he did to show that you can manipulate your analyses to get a statistically significant result?
      To show that you can pay to be published in one of those fake journals?
      To test the hypothesis that journalists will pick up fake stories and not do minimal fact checking?

      Each one of those is interesting but not original. That said, it’s a well put together package and I’m sure the documentary will get lots of views.
      I’ll be interested to see how many magazines at least publish a correction.

      • Even though those three facts may not be novel, obviously the problems still persist. Ergo, it is worth drawing wider public attention to all those problems until public awareness gets high.

  4. t’s more than slightly ironic that the article doesn’t mention following up by contacting the media outlets and telling them of the greater experiment, which might actually have been a useful action in as much as the opportunity existed to help people understand how science does and doesn’t work.

    I’d argue, as another commenter did above, that the focus on the p-value is a red herring. The scientific process depends on replication. When Bem published his papers on ESP, others tried and failed to replicate the experiments. Even if Bem’s experiment was bad, science as a whole worked exactly as it should have. The chocolate study, now having “discovered” something, can be used as a basis to try and replicate an experiment with preregistered conditions. It’ll probably fail to replicate, or what would be deliciously (pun intended) ironic, it might actually work. :)

    I struggle to understand what we’re hung up on. Is it really p-values? Or “stupid” people who don’t understand the scientific process and believe everything they read? Or journalists who as a subset of “stupid” people publicize it? Or the fact that too few researchers want to spend time doing the necessary replication work, which is probably neither intellectually stimulating nor likely to get you much funding?

  5. Via

    “Late on Thursday, the editor of International Archives of Medicine posted the following statement on the journal’s Facebook page:

    Disclaimer: Weeks ago a manuscript that was being reviewed in the journal “Chocolate with High Cocoa Content as a Weight-Loss Accelerator” appeared as published by mistake. Indeed that manuscript was finally rejected, although it went online for some hours.”

    We are sorry for the inconvenience. We are taking measures to avoid this kind of mistakes happens again.

    • Were the International Archives of Medicine a real journal I would have to ask myself which was more disturbing: that the editor lied or the journal had a Facebook page.

    • Fernando:

      I read that opinion piece by Rachel Ehrenberg and I disagree with it. That said, I have sympathy for the author. As a college teacher, I was annoyed when someone did a bogus experiment using me as a guinea pig and wasting my time. If I were a reporter who was sent fake promotional material by Bohannon, I might well be irritated in the very same way. So, although I like that Bohannon did, I can also see where Eherenberg is coming from.

      • She does make an interesting statistical point though. If we call journalists lazy, gullible etc. it makes sense to put a number on it. What fraction of journalists are like this? Do people actually believe this kind of hyped science reporting or just treating it as a background noise (my question, not hers)? How many people? This seems to be precisely up your valley — there is no black and white, just statistics.

        • I found the piece by Rachel Ehrenberg rather annoying. She asks “Did Bohannon consider how many readers will conclude from their prank that all of journalism and all of science is not to be trusted?”

          Sadly, most of science journalism is not to be trusted in practice. These wasn’t an outlier study where science journalists chose an iffy study to promote. It seems rather the norm than the exception. And its scary to see how many media outlets took bait. And they weren’t very obscure ones either.

          Ergo, even if Bohannon’s study resulted in an even higher distrust level of popular journalism it wouldn’t be very bad really.

          Rachel Ehrenberg’s fundamental motivation seems a defense of journalism not the correctness of science communication.

        • One of the still ongoing impacts of the Tuskegee study is that African Americans are disproportionately reluctant to participate in clinical trials, science research of any kind and are distrustful of science and doctors. All sounds good in the way you are saying until you realize that means that they are not getting the treatments that others get AND the data are worse because they are missing the distrustful part of the population.

        • The big difference is that clinical trials have *changed*. We can be very very sure we no longer do those things.

          The day we can have similar confidence about science journalism, yes it will be a loss to the distrustful public. But that point seems very far.

          Till then a public distrustful of science journalism is acting in its best interests.

  6. Pretty funny stuff! I had just this experience only this week. In a seminar the audience was sagely advised by a lecturer in the Kennedy School of Govt at Harvard, on loan to Sydney University that some aspect of behavioral economics was ‘statistically significant’….Ohh, I thought, does that mean that it was actually useful or has science been displaced by arithmetic. A NSW Govt. PhD added fuel to the fire by telling us the same thing…his work in this area was also statistically significant…I could just see the p-values floating before my eyes and wondering at the quality of a PhD these days.

  7. It is fun when a post provokes this much interest.

    By the way Alex Reinhart published a little book “Statistics Done Wrong” which attempts to catalog a number of inept ways of doing statistics.

    • He writes in “conclusion 1”: “I wouldn’t find any of these studies alone very convincing. But together, they compensate for each other’s flaws and build a pretty robust structure.”

      I wish he would explain exactly how he came to this conclusion, especially since his list of findings are all positive (possibly cherry-picked?). I haven’t read that literature so I have no idea.

    • Troy:

      I followed the link. Much of what he says seems reasonable, but his representation of statistics reveals some misperceptions. For example, he writes, “the whole point of p-hacking is choosing at random form a bunch of different outcomes.” But, no, that’s not right, nobody’s choosing at random; rather, they’re constructing their comparisons based on some combination of their data and their theoretical ideas.

      I agree with his statement, “there is no one-size-fits-all solution to make statistics impossible to hack.” But, as Loken and I discussed in our “forking paths” article, I think that the framework of “hacking” is misleading, in that the big problem is not so much people trying to “hack” their results bit rather the larger framework in which people expect to make scientific discoveries in this way.

  8. They created the “Institute of Diet and Health” and its IRB approved the study. Totally by the books.

    I hate disclaimers, but I am not being sarcastic. This was a perfectly normal IRB-approved study.

  9. Dr. Gelman,

    I’m just getting into Bayesian methods and I am using your textbook (and the one by Kruschke) to learn. I’d say I have an intermediate-level understanding of how to do Bayesian Inference using either HDIs + ROPEs or Model Comparisons + Bayes’ Factors. For this chocolate study, I’m not sure how on earth I’d go about setting up a prior or even a likehood model for that matter.

    I was thinking, that since this is such a high-profile case that has already caught your attention, would you consider showing the world how to analyze this chocolate data using Bayesian methods to show how the analysis results differ?

Leave a Reply

Your email address will not be published. Required fields are marked *