Jessica Tracy and Alec Beall (authors of the fertile-women-wear-pink study) comment on our Garden of Forking Paths paper, and I comment on their comments

Jessica Tracy and Alec Beall, authors of that paper that claimed that women at peak fertility were more likely to wear red or pink shirts (see further discussion here and here), and then a later paper that claimed that this happens in some weather but not others, just informed me that they have posted a note in disagreement with an paper by Eric Loken and myself.

Our paper is unpublished, but I do have the megaphone of this blog, and Tracy and Beall do not, so I think it’s only fair to link to their note right away. I’ll quote from their note (but if you’re interested, please follow the link and read the whole thing) and then give some background and my own reaction.

Tracy and Beall write:

Although Gelman and Loken are using our work as an example of a broader problem that pervades the field–a problem we generally agree about–we are concerned that readers will take their speculations about our methods and analyses as factual claims about our scientific integrity. Furthermore, we are concerned that their paper will misrepresent aspects of our research, because Gelman previously wrote a blog post on our research, published in Slate, which contained a number of mischaracterizations [see the three links in the first paragraph above; you won’t be surprised to hear that I don’t think I mischaracterized Tracy and Beall’s work, but clearly there has been some failure of communication.—AG] . . . we are posting here new information that we have also directly provided to Gelman and Loken . . .

Following the publication of our paper . . . we conducted a new study seeking to replicate our findings. This study produced a null result, but led us to formulate new hypotheses about a potential moderator of our previously documented effect (see here for a detailed description of this failure to replicate and our subsequent hypotheses). We found preliminary support for these new hypotheses in re-analyses of our previously published data, and so moved on to conduct a new study (N = 209) to directly test our new theory. This study proved fruitful; a predicted interaction emerged in direct support of our hypotheses. All of these results can be found in “The impact of weather on women’s tendency to wear red or pink when at high risk for conception” . . . Of note, this paper and the Psych Science paper together report ALL data we have collected on this issue . . .

Regarding the robustness of our main effect, we have now run new analyses testing for this effect across all these collected samples—the two samples we originally reported in our Psych Science paper, and the two new samples that comprise the two new studies reported in the PLoS ONE paper. Together these comprise a sample of N = 779. Although we expected the main effect to be considerably weaker across these samples than it was in our initial studies, due to major variance in the moderator variable that we have now found to influence this effect, we nonetheless found consistent support for that main effect. . . .

They follow up with many details of their statistical analysis, and again I encourage readers to go to their note. I have linked to it and quoted from it here to give them the same level of exposure that I have when posting on this blog.

Now for some discussion, which I thought it best to post this right away. As the saying goes, I apologize for the length of this post; I did not have the time to make it shorter.

You can have a multiple comparisons problem, even if you only performed a single analysis of your data

This all started a couple weeks ago when Tracy and Bell informed me that they’d come across a preprint of my recent (unpublished) paper with Eric Loken, The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. This paper of ours is no secret; it’s openly posted on my website and I’ve referred to it a few times on the blog.

In section 3 of our paper, Eric and I discuss problems of multiple comparisons that, as we see it, destroy the interpretation of the p-values in Beall and Tracy’s paper. We did not feel that this discussion was controversial—after all, we had already discussed these issues in various places. Rather, we used their example in our paper in part because Beall and Tracy had assured us that they did not do any selection of different hypotheses to test. As we wrote:

Even though Beall and Tracy did an analysis that was consistent with their general research hypothesis—and we take them at their word that they were not conducting a fishing expedition—many degrees of freedom remain in their specific decisions. . . . all the [analysis contingent on data] could well have occurred without it looking like “p-hacking” or “fishing.” It’s not that the researchers performed hundreds of different comparisons and picked ones that were statistically significant. Rather, they start with a somewhat-formed idea in their mind of what comparison to perform, and they refine that idea in light of the data. . . . the data analysis would not feel like “fishing” because it would all seem so reasonable. Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses. They would feel no sense of choice or “fishing” or “p-hacking”—even though different data could have led to different choices, each step of the way. . . . In this garden of forking paths, whatever route you take seems predetermined, but that’s because the choices are done implicitly. The researchers are not trying multiple tests to see which has the best p-value; rather, they are using their scientific common sense to formulate their hypotheses in reasonable way, given the data they have. The mistake is in thinking that, if the particular path that was chosen yields statistical significance, that this is strong evidence in favor of the hypothesis.

The above quote from our paper is notable because, in their recent note, Tracy and Beall write:

Gelman and Loken’s central concern is that our analyses could have been done differently – including or excluding different subsets of women, or using a different window of high conception risk. They imply that we likely analyzed our results in all kinds of different ways before selecting the one analysis that confirmed our hypothesis.

No. We did not imply this. As we wrote:

We take them at their word that they were not conducting a fishing expedition . . . In each of these cases [of multiple comparisons], the data analysis would not feel like “fishing” because it would all seem so reasonable. Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses.

I guess we should work on making this clearer in the revision. In particular, in the sentence, “It’s not that the researchers performed hundreds of different comparisons and picked ones that were statistically significant,” we could change “hundreds of” to “many.”

In any case, Eric and I do not imply that Tracy and Beall “likely analyzed [their] results in all kinds of different ways before selecting the one analysis that confirmed our hypothesis.” Rather, the whole point of our paper was the opposite: that even if, given their data, they only did a single analysis, they could’ve done other analyses had their data looked different. (And, indeed, in their second study, they got different data and they did different analyses.) Such behavior is not necessarily a bad thing—as a practicing statistician, my analyses are almost always contingent on the actual data that appear—but they invalidate p-values.

To say this one more time, let me quote from the title of our paper:

Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time

Our whole point was that Tracy and Beall’s p-values should not be taken as strong evidence of their research hypotheses, even if, conditional on their data, they only did a single analysis, and indeed we are taking their word that they did no analyses that were unreported in their paper. I’m not sure how much clearer we can make this point, given that it’s in the title of our paper, but given the confusion here, we will try.

The data

In their email, Tracy and Beall asked us to put them in contact with the editor of the journal who is handling our paper, and they told us that if we were to amend our paper to take into account of their new analyses, they would share their raw data with us if we would like. We replied that we would bear their points in mind when making revisions of our article, and in particular we would try to make clear that the multiplicities we discuss represent potential analyses that could have been done had the data been different, which indeed is a separate question from their discussion of alternative analyses of the existing data. We also assured them that we would inform the editor of their concerns. But we told them that we did not feel that it makes sense in the editorial process to give any special role to the authors of papers that we discuss. It is common practice for papers in statistics and research methods to cite and discuss, often critically, the work of others, and in general the authors of work being discussed do not necessarily get any privileged role in the editorial process. We also said that we think the idea of public discussion of these examples is a good one, perhaps involving Petersen et al., Durante et al., Bem, and others authors of controversial studies which have been identified as having multiple-comparisons problems.

Finally, we recommended that, rather than send their data to us, that Tracy and Beall post their data on the web for all to see. I don’t think they have any obligation to do so, nor do I think there is even an expectation that this be done—my impression is that is uncommon for psychology researchers to post the (anonymized) raw data from their published papers, and I certainly don’t think that Tracy and Beall have any additional requirement to do so, just because their paper is controversial. If they would like to post their data, I encourage them to do so; if not, it’s their call.

In any case, I think the question of whether we should invite Tracy, Beall, Bem, and others into our editorial process, or the question of how widely these researchers should share their data, is separate from the statistical and scientific question, What do their studies tell us? I’m bringing up these “process” issues because they arose in our email interactions with Tracy and Beall, but they are a separate matter from the questions of scientific inference which I would like to focus on here.

When is peak fertility?

Beall and Tracy define peak fertility as days 6-14 of the menstrual cycle. When I first saw this it seemed odd to me because, when we were trying to get pregnant a few years ago, my wife’s gynecologist had told us to focus on days 10-17, or so I recalled.

So I decided to look things up.

As you probably know, peak fertility varies. There are no sure things when it comes to the fertility calendar. However, we can get some basic guidelines from the usual public health authorities. For example, here’s what the U.S. Department of Health and Human Services has to say about Natural Family Planning. Under the Calendar Method, they recommend you compute the first day when you’re likely to be fertile as “Subtract 18 days from the total days of your shortest cycle,” and for the last day they say, “Subtract 11 days from the total days of your longest cycle.” Beall and Tracy assume a 28-day cycle (this is somewhat ambiguous because they also number the days as going from 0 to 28), so that will take you from days 10-17. Or, if you want to say that the shortest cycle is 27 days and the longest is 29, that will give you the range 9-18.

Or we could try Their ovulation calculator says the most fertile days are 12-17. I’m pretty sure that other sources will give other dates. But I don’t think anyone out there is including days 6 and 7 in the peak times for fertility.

So where did Beall and Tracy’s fertility days come from? They write that “specified window is based on prior published work” and they give tons of references, but if you try to track these down, there’s not much there. My best guess is that Beall and Tracy followed a 2000 paper by Penton-Voak and Perrett, which points to a 1996 paper by Regan, which points to the 14th day as the best estimate of ovulation. Regan claims that “the greatest amount of sexual desire was experienced” on Day 8. So my best guess (but it’s just a guess) is that Penton-Voak and Perrett misread Regan, and then Beall and Tracy just followed Penton-Voak and Perrett.

But it doesn’t really matter. We’re doing science here, not literary scholarship, and if we’re studying the effects of fertility, we should use actual fertility, or the closest we can measure. I’m going with the Department of Health and Human Services until you can point me to something better.

Beall and Tracy write, “other researchers have used a slightly different window,” but 10-17 is a lot different than 6-14! Nearly half of Beall and Tracy’s interval is outside the HHS interval, and on the other end they’re missing three of HHS’s peak-fertile days.

Why does this matter?

Why do I harp on the days of peak fertility? In some sense it’s not a huge deal; indeed, had Beall and Tracy used the more standard 10-17 day window, maybe the comparison reported in their paper would not have come up statistically significant, but I think it’s possible they would’ve seen something else interesting in their data (for example, a comparison between partnered and unpartnered women, or between young and old women) that was statistically significant, and then we’d still be having this discussion in a different form.

But I think the days-of-peak-fertility story is important because it indicates something about Tracy and Beall’s approach to research—and, I fear, a lot of other research in this field. The problem, I see it, is that their work has lots of science-based theorizing (arguments involving evolution, fertility, etc.) and lots of p-values (statistical significance is what gets your work published), but not a strong connection between the two.

Consider the following statement of Tracy and Beall:

It doesn’t particularly matter which window researchers use, as long as they make an a-priori decision about which to use and then run analyses for that window only.

This represents what one might call a “legalistic” interpretation of the p-value. It doesn’t matter what we are measuring, as long as we get statistical significance. Now, as I’ve pointed out in various places, I don’t think this legalistic reasoning holds here, because there’s no evidence that the authors made an a priori decision to analyze the data the way they did. In none of their papers is there preregistration of the decisions for data inclusion, data coding, or data analysis. As Eric and I emphasized in our paper, even if a research team only does a single analysis of their data, this does not at all imply they would’ve done that same analysis had the data been different. Beall and Tracy may have made an a priori decision to use this window, but they would’ve been free to change that decision had their data been different. That sort of thing is why people preregister.

But here I want to take a different tack, and just note the absurdity (to me) of Tracy and Beall saying that it doesn’t matter if they get the dates right! Just think about this for a moment. Here are the authors of two published papers on fertility, and they didn’t even bother to talk with a gynecologist or even look things up on the internet. [Correction: it is possible they talked with an expert or looked things up on the internet, or both. I have no idea who Tracy and Beall talked with or what they checked. It might be that the expert told them that the days of peak fertility were days 10-17 or that they looked at the U.S. Department of Health and Human Services website, but they decided to go with days 6-14 anyway. Or it’s possible they talked with an gynecological expert who told them that days 6-14 were peak fertility or that they happened to encounter a website that gave those dates.] I’m not an expert on the field but I just happen to be a middle-aged guy with a middle-aged wife so I noticed something funny.

Don’t Beall and Tracy care [just to be clear: I have no doubt that they care about the degree of truth of their research hypotheses; the thing I’m asking is if they care that they might have gotten the dates of peak fertility wrong]? I agree with the commenter who wrote:

The problem is simple, the researchers are disproving always false null hypotheses and taking this disproof as near proof that their theory is correct.

This isn’t how science should go. Look at it this way: suppose that the Beall and Tracy paper had no multiple comparisons issues, suppose they’d pre-registered their analysis, suppose even that they’d managed to replicate their results without having to add a new interaction mid-stream. Even so, under that best of all possible worlds (a world which did not exist), they wouldn’t have a finding about peak fertility, they’d have a finding about days 6-14. That should be interesting, no? Upon learning they got the dates of peak fertility wrong, Tracy and Beall’s response should not be: Hey, it doesn’t matter. It should be: Hey, this is a big deal! Our experiment did not tell us what we thought it told us. But they didn’t react this way. Why? I don’t think Tracy and Beall are bad people or that they are unconscientious scientists. I have every reason to believe they are doing their best. But I think they’re missing the whole point of statistical measurement, if they think “it doesn’t particularly matter” what they are measuring, as long as they decided to measure it ahead of time. It does matter, if you’re ever planning to link your statistical findings to your scientific hypotheses.

And this all affects Beall and Tracy’s statement. In particular, they report in their recent note that repeating their analysis using the days 7-14 window (recall that they originally used days 6-14) yields an odds ratio of 1.76 and a p-value of 0.046. Their original paper reported odds ratios of 3.85 and 8.67, so I guess a lot must have been happening on day 6 of the menstrual cycle. [Correction: Tracy just emailed me and pointed out, in addition to shifting their window by one day, their new analysis includes additional data, so that’s why the estimate shifted so much.]

In any case, the point is that if you want to study peak fertility, you should study peak fertility, which, according to the most authoritative source I can find, goes from days 10-17. Shifting from days 6-14 to days 7-14 isn’t enough. It is striking that their estimate changes so much from a 1-day shift [correction: not so striking given that new data were pooled in], but perhaps this is not such a surprise given the small sample size and high variation.

Again, let me emphasize that this particular analysis is not central to our point; Eric and I in our article list many dimensions of multiple comparisons, and I’d like again to point all of you to the excellent 50 Shades of Gray paper by Nosek, Spies, and Motyl, where they demonstrate the garden of forking paths in action. I’m going into detail about this peak fertility thing because I want to emphasize my statistician’s perspective that, if you care about the science, you should care about the measurement.

I don’t criticize Beall and Tracy for getting the dates of peak fertility wrong—it’s natural to trust the literature in your subfield, and we all make mistakes. (Indeed, I had to retract the entire empirical section of a paper after I learned that I’d miscoded some survey responses.) But I am bothered that, all these months after I pointed out the dates-of-peak-fertility thing, they haven’t even checked. They should check: After all, maybe I’m wrong! I’ve been wrong before. But if I’m not wrong, and the U.S. Department of Health and Human Services really does say the dates of peak fertility are days 10-17, then it’s time for Beall and Tracy to take a deep think. After all, the phrase “peak fertility” is in the title of their paper.

It sort of bothers me to keep writing this, not out of fear of redundancy—we’re already long past that point—but because indeed I might be missing something obvious and come out looking like a fool. But fair’s fair, if I really am being foolish, then you might as well get the opportunity to find out. If my only interest were to make an airtight case against the Tracy and Beall claims, I might not even bother with the peak fertility thing, as we have enough other strong points. But I do think it’s worth mentioning, even at some risk of embarrassment, because to me this issue is symptomatic of a big problem in this sort of research, which is that people often aren’t measuring what they say they’re measuring, and they don’t even seem so bothered when people point this out.

Positive recommendations

What should Beall and Tracy do? I don’t know that they’ll follow my recommendations, as they probably feel that I’m picking on them. I guess I’d feel that way if the situation were reversed, and all I can say to them is Eric and I are using their work as one of several case studies in an article that’s all about how statistics can be misused, even by researchers who are trying their best and are not “cheating.” To put it in Clintonesque terms, I feel your pain:

We’ll think of the faith of our advisors that was instilled in us here in psychology, the idea that if you work hard and play by the rules, you’ll be rewarded with a good life for yourself and a better chance for your research hypotheses.

I think Beall and Tracy do work hard and play by the rules, and unfortunately there’s the expectation that if you start with a scientific hypothesis and do a randomized experiment, there should be a high probability of learning an enduring truth. And if the subject area is exciting, there should consequently be a high probability of publication in a top journal, along with the career rewards that come with this. I’m not morally outraged by this: it seems fair enough that if you do good work, you get recognition. I certainly don’t complain if, after publishing some influential papers, I get grant funding and a raise in my salary, and so when I say that researchers expect some level of career success and recognition, I don’t mean this in a negative sense at all.

I do think, though, that this attitude is mistaken from a statistical perspective. If you study small effects and use noisy measurements, anything you happen to see is likely to be noise, as is explained in this now-classic article by Katherine Button et al. On statistical grounds, you can, and should, expect lots of strikeouts for every home run—call it the Dave Kingman model of science—or maybe no home runs at all. But the training implies otherwise, and people are just expecting to the success rate you might see if Rod Carew were to get up to bat in your kid’s Little League game. (Sorry for the dated references; remember, I said I’m middle-aged.)

To put it another way, the answer to the question, “I mean, what exact buttons do I have to hit?” is that there is no such button. But I suspect that Tracy and Beall have been trained (implicitly, presumably not explicitly) to expect an unrealistically high rate of success. They really seem to believe that they can discover enduring truths through short questionnaires of Mechanical Turk participants, and once they think they’ve had such success, they understandably are not happy with anyone tries to take this away from them.

Tracy and Beall conclude the main body of their recent note with:

Indeed, we hope other researchers will join us in seeking a more precise estimate of the main effect and of the weather moderator, and in discovering other variables that no doubt also moderate this effect.

But I’m sorry to say that all the evidence I’ve seen suggests that they are chasing noise. Their results and their papers look just like what I’d expect them to look like, if they were studying effects so small as to be undetectable with the tools they are using.

It makes me sad when people chase noise, so I concluded my email message to Tracy and Beall with some advice:

We wish you the best of luck with your research and we encourage you to consider within-subject designs with repeated measurements, which should give you more of a chance to discover stable patterns in your experimental data.

And I meant it. Actually, I think Eric means it even more than I do, as he’s been the one who keeps banging on the importance of within-person studies for estimating within-person effects. Yes, if you’re careful you can estimate within-person effects with between-person studies, and yes, I know about the concerns with poisoning the well, but in a case like this, where you’re interested in how individual behavior is changing over time, and there is so much individual-level variation (including, in this case, factors such as what clothing a woman has in her closet), I really think that a within-subject design is the only way to go. Such a study takes more effort—but, again, if we accept that important breakthroughs can require lots of work, that’s fine. A Ph.D. thesis can take 2 years or more, right?

Preregistered replication could help too but that’s not really the most important thing. I think any researchers in this area really need to focus first on studying large effects (where possible) and getting good measurements. There’s no point in trying to get ironclad proof that you’ve discovered something, until you first set things up so that you’re in the position to make real discoveries.


I don’t particularly care about fertility and choices of clothing, and I do feel that these researchers are shooting in the dark. I just spent three hours writing this (instead of working on Stan; sorry, everyone!) because (a) I do feel this example illustrates some important and general statistical points, and (b) Tracy and Beall are saying that Loken and I were wrong, so of course I want to clarify our points. Beall, Tracy, Loken, and anyone else can further clarify or dispute in the comments.

Or of course Eric and I could’ve lain low and just waited till our paper appeared, but as I wrote in the very first paragraph above, that didn’t seem fair. I wanted to link to Tracy and Beall’s note right away, so that they can get as large an audience for their statements as we have for ours. I can’t promise this treatment for everyone whose work I criticize, but when people go to the trouble of writing something and alerting me to it, I like to give them the chance to air their views. I’ve done this several times before and I’m sure will be doing it many times in this blog in the future.

Meanwhile, if you’re interested in scientific misunderstandings but you don’t want to read more about fertility and clothing choice, I recommend you go to Dan Kahan’s cultural cognition blog.

P.S. I have more to say about the role of statistics in scientific practice and the role of statistics in science criticism (the so-called “replication police”), but that really deserves its own post so I’ll stop here.

P.P.S. Just one more time: I have no desire to hurt Tracy and Beall in any way. In all sincerity, I applaud their openness, both in contacting me and in posting their views. For the reasons described in (exhausting and repetitive) detail above, I think they’re basically wrong, but I also think I can see where they are coming from. They followed standard practice and achieved the great success of publication in a top journal. When their work was criticized, they chose to defend rather than reflect, but, again, that’s a perfectly normal choice. And at this point it’s hard for them to go back. Especially given the traditional norms of statistical work in their subfield, it seems most natural for them to continue to think they did things right. Conditional on all that, Beall and Tracy seem to me to be behaving in a cordial, professional, and scholarly manner. (Not that I have any special status to judge this; I’m just stating my impression here.) Even their decision not to share their data seems reasonable enough from their perspective: Researchers don’t usually post raw data, and in this case they could well feel that Loken and I just want their data to do a hatchet job on them. That’s not the case—there were just a couple of things that Eric wanted to look into—but I can see how Tracy and Beall could think that, conditional on their continuing to believe in the correctness of the conclusions they drew from their published analyses. I don’t think it’s too late for them to sit, think, and see the problems—maybe a couple of preregistered replications on their part could help out, although I don’t think that should be truly necessary—but in any case I appreciate the cordial spirit that these discussions have taken.

P.P.P.S. In response to an email Tracy just sent me, I’ve added several clarifications in [brackets] above. Let me also emphasize that I had no intention of being offensive or defamatory of either Beall or Tracy in any way, nor did I intend in any way to question their integrity. I think they are following common research practices that are, in my view, mistaken, but this is not a comment on their integrity in any way.

In particular, I regret implying that Beall and Tracy “don’t care.” Their take on peak fertility is different from mine, but I can see how it makes sense for them to follow the literature in their subfield, even if from my perspective these are not the most authoritative sources on the topic. “Don’t care” isn’t a fair summary of “disagree on what is the most authoritative source on peak fertility.”

155 thoughts on “Jessica Tracy and Alec Beall (authors of the fertile-women-wear-pink study) comment on our Garden of Forking Paths paper, and I comment on their comments

  1. This seems a bit unfair. The authors can’t possibly defend themselves against the Freudian claim that they unconciously chose what fork to take on the road to significance. And as someone who has done data mining I think it is unrealistice to expect that ANYONE would know the correct series of forks that would lead them to significance. With respect to the comment that the exact date of fertility may not matter, I agree so long as the strength of the effect is pulled from som distribution where the mode is at the peak fertility, for example. I guess a Bayesian approach may be called for here. Other than that, I feel compelled to reply for my first time on this site because I feel that you [AG] spend too much time with armchair psychoanalysis of these researchers. It makes me wonder what Freudian skeletons I would find in your closet ;-)

    • I’m confused–why do you think multiple comparisons have anything to do with psychoanalysis? The “garden of forking paths” has nothing to do with what’s going on inside a researcher’s head. If your statistical decisions as to what analysis to run are data-dependent (meaning that, had the data come out differently, you’d have run different analyses), there’s a garden of forking paths issue.

      • I guess the psychoanalysis charge comes out of quotes like: “They [Tracy and Beall] really seem to believe that they can discover enduring truths through short questionnaires of Mechanical Turk participants, and once they think they’ve had such success, they understandably are not happy with anyone [who] tries to take this away from them.” More generally, this quote highlights to me that much of this discussion focuses on understanding researchers motives, implicit or otherwise. For example, AG claims: “the data analysis would not feel like “fishing” because it would all seem so reasonable. Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses. They would feel no sense of choice or “fishing” or “p-hacking”—even though different data could have led to different choices, each step of the way.” And “In this garden of forking paths, whatever route you take seems predetermined, but that’s because the choices are done implicitly.”

        Ultimately, I see all these arguments by AG as assertions but I don’t see any factual data that confirms that these sorts of decisions actually happened – and affected the outcome of the study. The word Freudian came to mind because I began to wonder how you might test the hypothesis that these forking paths actually had an impact on the outcome of the study.

        Finally, AG states that “The mistake is in thinking that, if the particular path that was chosen yields statistical significance, that this is strong evidence in favor of the hypothesis.” I would go further and say that all this discussion is a red herring. Given the resources and funding available to most scientists these days, the scope of any single study can only be seen as one data point in favor or agains the overall hypothesis that was tested. It still requires replication and the failure here in my view is not in the statistical choices but in a publishing system that actively discourages replication.

        • Anon,

          Is “Freudian” an insult these days? You see to be using it as a synonym for untestable. I don’t know enough to have an informed opinion either way, just find it interesting is all.

        • I’m not sure it’s unfair to speculate on what happens in a researcher’s mind (whether conscious or not), given that the correct computation of the p-value depends on it.

    • “I think it is unrealistice to expect that ANYONE would know the correct series of forks that would lead them to significance”

      If you wander around in the data woods long enough, you will stumble across things that are statistically significant. That’s what Andrew is saying, and it’s pretty basic stuff that’s widely overlooked. If you’re going to have any credibility, you need to do one of three things:

      1. State your hypothesis clearly, design an experiment that excludes other possibilities, and then implement that experiment.

      2. Somehow compensate for the fact that you’ve wandered around and taken many forks in the road. There are a variety of ways to do this, ranging from correction factors to MCMC analysis.

      3. Admit that you’ve wandered in the woods until you found something, do not try to dress your wandering up in hypothesis-testing clothing, and be as explicit as you can in terms of the paths you took to get to your finding.

      Andrew’s objection, as I understand it, is that the Fertile-Pink study have not done any of these. Rather, they have chosen the common strategy of wandering in the woods, eventually finding what they wanted to find, then naively assuming that they could’ve taken a path directly to their finding and thus it’s the same as if they’ve planned to go down that path all along. The same with your quoted statement, above.

      • I wish this website let you delete postings. I posted before having any tea, and basically repeated the obvious. Let me try again…

        “I think it is unrealistice to expect that ANYONE would know the correct series of forks that would lead them to significance”

        Well, you could simply try a single fork — the first fork you come to — to see if it makes a difference. Then another fork. Then another. You know, sort of the Thomas Edison quote to the effect that he hadn’t failed 1,000 times to make a lightbulb, but rather had identified 1,000 ways to not make a light bulb. Unfortunately, this doesn’t lead to instant fame, and it requires focus… it’s so 19th century.

        Or you could leave out the hypothesis testing and “significance” part, since you don’t actually have *a* hypothesis to test . That’s the price of admission for using hypothesis testing: having a pre-defined and clear hypothesis with an experiment designed to eliminate all other possibilities. It’s perfectly okay to do an experiment that uses a sample of convenience, and which you test every possible color and weather combination — far exceeding the number of participants, even — and finally find something that works. Publish your “intriguing” and “suggestive” findings. You may become famous, and be thought very clever.

        There are some intermediate solutions between these two extremes that have some level of rigor (adjustment factors, MCMC experiments), and you can use data that someone else has gathered for, perhaps, other purposes. Each of these options has a cost and ultimately restricts the conclusions you can draw.

        The particular studies that Andrew is criticizing happen to be structured in a way that makes it particularly easy for us to imagine and illustrate some of the many forks that were either never considered or were brushed under the rug. If they’d settled for my second option (“intriguing”, “suggestive”) and skipped the whole “statistical significance” part, Andrew would simply be a pedant. They didn’t, and he isn’t.

      • I disagree that Andrew is saying this – he repeatedly has said that he isn’t accusing them of wandering around in the data woods. And so I am perplexed as to what he thinks the authors have done incorrectly other than somehow subconciously guiding their analyses along a path that will yield significance. Which is why I continue to insist that these are Freudian speculations. It’s possible that the peak fertility dates used by the authors are the smoking gun Andrew is using, but then again he specifically says he takes them at their word that this was an honest choice based on their lit review.

        • Anon:

          As Eric and I discuss in our linked paper, there are several parts in Beall and Tracy’s analyses that involve degrees of freedom that were not pre-specified. Given that many of these decisions clearly came after the data appeared, and given that other similar papers (by Beall and Tracy and by others) had similar forks in the road and chose different paths), it seems reasonable to consider these choices to be contingent on data. Again, in a given paper a research may follow only one path and that can seem pre-determined, but the choices depend on what data the researcher saw. These choices could be conscious or subconscious, I don’t really care. The point is that the p-value is a statement about what would’ve happened had the data been different, and I have no reason to believe that the authors would have done the same things had the data been different. I accept that Beall and Tracy’s data-exclusion rules and data-analysis rules are consistent with their substantive hypotheses, but many other rules would also be consistent. This relates to something we’ve discussed on the blog from time to time, the distinction between a scientific hypothesis (which is almost always somewhat vague, especially in the human sciences) and a statistical hypothesis (which is almost always highly precise).

        • Andrew,

          I think Anon’s confusion (and mine) is about how a single analysis can be data contingent in a meaningful way. You can’t just look at a spreadsheet of numbers and know that a certain analysis is more likely to yield a significant p-value than another. It would seem that you have to analyze the data somehow before the analysis you choose can be data contingent. Maybe the initial analysis takes the form of graphical data exploration, for example.

          As I write this, I realize that the word ‘analysis’ is being used in two ways. Sometimes it refers to the start-to-finish handling of the data, and sometimes it refers to the final regression model (or whatever) that you fit. I think you’re saying that the pre model fitting stage of a start-to-finish analysis (even one that contains only one model fitting) can be used to choose a model that seems likely to produce a low p-value.

          This is clearly true, but it seems hard to choose the most promising model settings intentionally and even harder unintentionally! So I still don’t see how your critique applies without some sort of intention on the part of the researchers. Maybe their p-hacking occurs in the pre-model stage of the analysis and they only run one model, but it’s still p-hacking.

        • Zach:

          I don’t think you need intention to cheat or even intention to select—you just need intention to do good science. I’ll try to describe this generically to not be involved in the details of the Beall and Tracy study (see the linked paper with Loken for examples in that context). Suppose you are interested in the hypothesis that variable x is strongly associated with outcome y. Then, as your data are coming in, you see someone with a high value of x but a low value of y. That doesn’t look right! Then it turns out that the person looks odd in other ways, for example he gave what seems like a joke response to another question. So you throw him out of the data on the grounds that he was not giving serious responses. This could be fine, but of course it can bias things because you might not throw out a respondent who fit your story. But it doesn’t feel like fishing or anything like that. It’s just a data-dependent selection rule.

          Similar things arise in statistical analysis. Eric and I talked a lot in our paper about the choice of whether to focus on main effects or interactions. This is something we see a lot. A research team studies a phenomenon, I suspect they’re looking for a main effect. The main effect isn’t there, or it’s not statistically significant—but of course we all know that “statistically significant” doesn’t mean “zero.” Or maybe the main effect is statistically significance. In any case, the researchers then notice a statistically significant interaction—a difference between partnered and unpartnered women, or a difference between people of low and high SES, or a difference between two different weather conditions. The researchers realize that this interaction fits their theory, so that’s what they report.

          What happened there? They weren’t fishing around, they weren’t cheating, they just saw some patterns in the data. It feels like “discovery,” indeed in some sense it is discovery, but the math of statistical significance is misleading them, and they don’t realize that what they saw, could well have arisen from chance alone. Then someone like me comes along and says they had a multiple comparisons problem, and they reply: No, we don’t! We only did that one analysis! And I say, No, your analysis is contingent on data . . . hence this paper!

        • Perhaps it’s semantics, but I call what you describe above as “fishing” and that is the problem I have with your critique – you provide no real evidence that T&B actually did any of this. The one example of them adding an interaction term to the analysis in a followup study doesn’t really count because the authors acknowledged it as post hoc and thus exploratory. Rather than making strong inference based on it they used that result to design a followup study. All that seems perfectly legitimate and not what you seem to describe above.

        • The basic issue you have here seems to be very poorly articulated. Your fundamental problem appears to concern how strongly the authors interpret their evidence to be. This whole matter of “multiple comparisons without actually making statistical comparisons” thing is an esoteric approach to the basic premise that one or two studies never amounts to strong evidence of anything. The same would hold true even if there were no “forking paths” issues, because there are always numerous potential alternative explanations for why the results might not replicate (e.g., convenience sampling, questionnaire wording).

          In my view, “forking paths” is nothing more than a minor note of the realities of dealing with actual data that don’t come from a simulation. Yes, there is a potential threat to your statistical validity, but that threat is not new. How do you decide if a case is an outlier that should be omitted from your analysis? How do you analyze highly skewed data if you weren’t anticipating that degree of skew in the first place? How do you deal with data that strongly appear to be attributable to error (e.g., equipment malfunction, operator error, respondents who write jokes on your survey form)?

          “Forking paths” is nothing more than “decisions you make when faced with real world data that may or may not ultimately change the outcome of the results.” You can make the exact same accusation of almost every study of real data that has ever existed. It’s not new, it’s not unique to these studies, and it isn’t even a particularly interesting reason why one or two studies does not add up to strong evidence of a phenomenon. It’s a weak, esoteric argument easily supplanted by more compelling methodological issues.

        • Brandon, the point is that the theory of p-values is affected by the usual scientific praxis. Andrew does not say, as far as I understand, that we should *not* make decisions contingent on the data. Rather that we should be more honest (also to ourselves) and open about the the process and use methods adequate for it.

        • Brandon:

          I do not claim that our paper is earth-shaking. You could think of it as a minor addendum to the (justly) influential paper by Simmons, Nelson, and Simonsohn on p-hacking. Beyond adding a bunch of examples, our only contribution beyond that earlier paper to emphasize that major multiple comparisons problems can arise from analyses contingent on data, even if researchers only perform a single analysis (or only a few analyses) on the particular data they saw.

          The commenter Anonymous above writes that we “provide no real evidence that T&B actually did any of this” but again this indicates a confusion on the commenter’s part because p-values are not about what a researcher did, they’re about what a researcher would have done had the data been different.

          From my perspective, the terms “fishing” and “p-hacking” are counterproductive because (a) they imply active behavior on the researcher’s part, which as we discuss is not necessary for there to be major multiple-comparisons issues, and (b) they motivate a sense of conflict and defensiveness. It’s not about me providing evidence that someone did something wrong; it’s about routine statistical practice being contingent on data, something that anyone who’s ever analyzed data should be aware of. Again, I recommend Nosek et al.’s “50 shades of gray” paper for a clear example.

          Anyway, from a statistical perspective our point is not deep; as you put it, it’s a “minor note,” a clarification of a bit of confusion that I think has arisen from terms such as “p-hacking” and “fishing.” We’re building on the important work of Simmons et al. We wrote the paper to clear up a misunderstanding, although obviously we should try to do a better job because this confusion still exists in the comments, with people thinking we’re accusing people of fishing, etc etc.

        • Responding to Andrew’s comment:

          “The commenter Anonymous above writes that we “provide no real evidence that T&B actually did any of this” but again this indicates a confusion on the commenter’s part because p-values are not about what a researcher did, they’re about what a researcher would have done had the data been different.”

          True but it still follows from your claim that these researchers make choices contingent on their data which remains speculation on your part. Showing that the authors made choices about which subjects to include is not the same thing as showing that the authors made choices contingent on the data (i.e. selection criteria could have been made prior to the experiment or without any access to the rest of the data).

    • Anonymous:

      There’s nothing Freudian in my post at all.

      But, speaking more generally, sure, I could’ve written my whole post without speculating about motivations. Indeed, had my goal merely been to win an argument regarding Tracy and Beall’s claims that the results in their sample would generalize to the larger population, it would’ve been cleaner for me to just stick with the facts and not speculate at all.

      But my purpose here is not to win an argument with Tracy and Beall, as their claims are so clearly mistaken. At this point I’m more interested in the question of how this sort of work gets done, and how is it that researchers, when pointed to such an obvious flaw as getting the dates of peak fertility wrong, don’t go back and redo their analysis with that in mind? For that, I think the speculation is helpful.

      In any case, the speculation is clearly labeled as such, so feel free to ignore it if you’d like.

      • The issue I have is that you seem to imply one thing but say the other. You emphasize that you aren’t accusing them of “fishing” but the whole point of their choice of fertility estimate seems to hint that they chose these dates because those yielded significance. I also don’t see why you have such a hard time accepting that the exact dates may not matter. Let’s say the probability of wearing pink/red is exactly proportional to fertility. The following is consistent with what you say the experts claim:

        fert <- dnorm(c(0:28), 13.,2)

        A comparison of the values for the dates they used suggests that the choice isn't as critical as you suggest.

        [1] 0.08618806
        [1] 0.01180566

        Maybe I'm missing something here but is there any researcher that could defend themself against these charges without explicitly laying out their data analyses first. I know you recommend that approach but as I said I would prefer to see more incentive for researchers to replicate results. Have you seen this:

        • Anon:

          Eric and I definitely need to clarify things in our paper because the point apparently still isn’t clear! We are neither hinting not saying that Beall and Tracy chose those dates because those yielded significance. I suppose maybe we should simply add a sentence to the paper, “We are neither hinting not saying that Beall and Tracy chose those dates because those yielded significance,” in order to make the point unambiguously clear. What we’re saying is that, had the data looked different, for example with a clear difference on days 10-17, Beall and Tracy might well have used a different window. I don’t know what they would’ve done, but it’s possible. And, using a different window would’ve been a reasonable choice—after all, that would be consistent with the information provided by the U.S. Department of Health and Human Services.

          Regarding your numerical point, let me just say that if the signal were so strong, Beall and Tracy would be in great shape! In your example, the direct comparison of mean(fert[c(10:17)])/mean(fert[-c(10:17)]) is 49, and the noisy comparison of mean(fert[c(6:14)])/mean(fert[-c(6:14)]) is 3.3. (For some reason, when I run your code in R, I don’t get the same numbers that you report, but the orders of magnitude are the same.) So, sure, if women are 49 times more likely to wear red shirts during that fertility window, then you’ll still get a ratio of over 3 when you use that wrong window. But there’s no way the ratio could be anything nearly that large. Any true effect will be small, and I think it will be essentially impossible to detect using the sort of between-person comparison they are doing. Beyond that, there’s no point in degrading your signal even more by measuring the wrong days.

          Regarding your last paragraph: As I noted in some comment or another here, it would be fine for Tracy, Beall, or others to try an exact replication here but I suspect it would be a waste of time until more effort is taken to do more accurate measurements. Again, I’m contesting the implicit model that a researcher is likely to discover enduring truths through this sort of quick survey. It’s possible but I think much much less likely than people seem to believe.

        • Well the specific example was designed to be the extreme case. Take a more realistic scenario with a flatter curve (I increased the SD). This gives a more realistic odds ratio using the correct fertility dates but my point remains that the choice of fertility dates may not be so critical:

          > fert mean(fert[c(10:17)])/mean(fert[-c(10:17)])
          [1] 3.436049
          > mean(fert[c(6:14)])/mean(fert[-c(6:14)])
          [1] 2.638908

          Obviously these data are deterministic but at least it focuses the discussion on how the choice might empirically affect the results of significance tests.

          Finally I think you draw a distinction without a difference when you state that “We are neither hinting nor saying that Beall and Tracy chose those dates because those yielded significance … What we’re saying is that, had the data looked different, for example with a clear difference on days 10-17, Beall and Tracy might well have used a different window.”

        • Anon:

          1. I do not think your example with an sd of 5 gives realistic odds ratios. Recall that these are intended to be differences across the entire population (including women who never wear red clothing, women whose clothing choices are determined externally, etc). It is hard for me to imagine a world in which the aggregate ratio is 3 or even 2. Such numbers look realistic only in the context of Beall and Tracy’s published numbers, but those numbers suffer from what we call the statistical significance filter: with a noisy study, any estimate that happens to be statistically significant will necessarily be huge.

          2. There is no doubt that using a bad window will attenuate the effect, that is it will reduce the power of the study, yield a lower effect size, and increase type M and type S errors. One of my points is that when you’re already using crude measurements, it’s a mistake to throw away information in this way.

          3. Beall and Tracy already have their data. Say they made a mistake and used a window of dates that does not correspond to peak fertility. Fine, we all make mistakes. But if peak fertility is what they care about (and they have been clear throughout that their hypotheses were theoretically motivated), then I’d think it would make sense to shift to days 10-17. I don’t see the point of continuing to measure the wrong thing, just because you did before. Not if the data are all there already.

        • 1. Well if not 5 for the SD then how about 10? I don’t know what the real OR is but note that as the OR of the true effect goes down (i.e. bigger SD around the peak of wearing red) their choice of fertility window becomes increasingly unimportant (at least for the simplistic model I presented).

          2. Of course I agree with you in principle it’s just I am not convinced the criticism is particularly damning.

          3. If they switch the window wouldn’t this be taken as evidence of forking? That said, I agree that the authors should use the best available estimate for fertility.

        • “What we’re saying is that, had the data looked different, for example with a clear difference on days 10-17, Beall and Tracy might well have used a different window.”

          How is that not p-hacking? It seems like you’re saying if the data had indicated that a low p-value could be achieved if the window were set to 10-17, then Beall and Tracy might have set the window to 10-17.

          If post-data analysis decisions don’t generally lead to lower p-values, why worry about them? (I know post-data analysis decisions still mathematically invalidate p-values, but you’re not mentioning overly conservative p-values very much as a problem, so it seems like you think post-data analysis mainly leads to lower p-values.) What mechanism other than researcher desire for low p-values makes post-data analysis decisions disproportionately lead to lower p-values?

        • Andrew, the selection point is clear. It doesn’t matter how many disclaimers you add, you are still going to come across as picking on these researchers for no particularly good reason. If you had chosen fictitious authors and a fictitious study, you’d probably be getting a different response. Frankly, a fictitious study would have been more appropriate since you keep postulating fictitious scenarios in which the authors behave badly (although you pretend that because they may be ignorant of their bad behavior, it’s less of an attack on them).

          “I am not hinting nor saying that Gelman fabricated large amounts of polling data, but had his data looked different, he might have fabricated large amounts of data to get these results. I don’t know what he would have done, but it’s possible.” Disclaimers don’t make everything better.

        • Brandon:

          I never suggested that Beall and Tracy fabricated any data. Nor do I say that Daryl Bem fabricated any data. Nor do we say in our article that Petersen et al. fabricated any data, etc.

          What I did say was, had their data looked different, I think all these researchers could’ve used different analyses. Eric and I talk in detail about this in our article. You might prefer to work with made-up examples and fake data. That’s fine. I prefer to talk about real examples. That’s what works for me. It’s hard for me to discuss these things entirely in the abstract.

          What I’m saying is that, for all these researchers, their data processing was contingent on the data. And for all the cases, Eric and I offer examples of idiosyncratic data exclusion, coding, or analysis rules that would be hard to imagine were pre-chosen.

          Doing data analysis contingent on data is not a negative thing! As I’ve also written in this thread, I think that in every applied problem I’ve ever worked on—certainly every serious applied problem I’ve ever worked on—my data analysis has been contingent on data. But I do think it makes a difference if you’re leaning on p-values as your evidence.

          So, yes, I’m negative on the work of Beall, Tracy, Bem, etc.: Despite their publication of statistically significant p-values, I don’t believe they’ve offered good evidence that the patterns they see in their samples, generalize to the population. I don’t believe that, in the childbearing-age female population at large, that women are 3 times more likely to wear red or pink shirts during certain days in their cycle, etc.

          But I have no reason to think they fabricated data, and I’m not accusing them of scientific misconduct. What I think is that they (and Bem, etc.) have misunderstood a subtle principle of statistics. P-values are well known to be tricky to teach and to understand, and it’s no surprise that even the most conscientious researchers can be tripped up.

          Finally, it’s not a “disclaimer” for me to say that I have no reason to think that Beall and Tracy were fishing for statistical significance or “behaving badly.” As I’ve explained, they’re working in a standard statistical paradigm that I believe is mistaken. I don’t think they’re “behaving badly” any more than I think that Francis Galton was “behaving badly” when he published a graph showing that there were 9-foot-tall men living in England. Galton made a mistake, that’s all, and I’m saying that Beall and Tracy have made a mistake, a mistake that’s unfortunately very common to make given the unfortunate implications of terms such as “fishing” and “p-hacking.”

          Finally, I think it’s rude of you to write that I’m “pretending.” I’m writing what I believe, I’m doing it for free, and I’m doing this as part of the “service” portion of my job. I’m working hard, even going in here and responding to comments, because I think it’s worth it to try to clarify tricky statistical points. Everything I write on this blog is sincere (except of course for the cases where I’m making jokes, which I’m not doing here at all on this thread). I’m not “pretending” anything, and if you want to believe that, you might as well go somewhere else and not bother reading what I write.

          P.S. I know this comment is long and it’s a bit ridiculous for me to spend my evening responding to blog comments, but I do think this point is important! We’re not accusing Beall and Tracy of scientific misconduct! We’re saying that they’re doing standard scientific conduct, but that this standard conduct creates major problems with p-values.

        • I find it interesting and telling that many people post as Anonymous here (to be fair, I have not scrolled through comments to other posts on this blog to see if this tends to be the norm). But particularly in light of the recent “bullying” accusations elsewhere, as well as the sentiment that people are afraid to speak out (e.g., to criticize anything about the replication movement or individual studies) for fear that their research will be gone after next (see, e.g., the discussions in the ISCON group on Facebook), it is interesting that those posting anonymously tend to be those speaking to the perceived unjust nature of the critique of this paper. I have not posted before and in the interest of full disclosure I should say that I am a colleague of Jess Tracy’s, although I have never discussed her research with her or even seen her give a talk (do not judge: we have a large department!). I wanted to write a signed comment to state my agreement with the posts by the Anonymous’es. This sort of thing is in the eye of the beholder, but perhaps the tone of the posts and the paper should be re-thought if enough people find it problematic.

          I think it is incredibly easy to find examples of papers where the researchers *do* do different analyses in light of the data not fully supporting their hypotheses, or where multiple comparisons abound. And I do see the point that a multiple comparisons problem is there even if only one analysis (by this I presume we mean a single statistical test) was performed, if some other data peeking occurred. But because the original authors went to the trouble of replying to all of the “accusations,” it is difficult to continue to argue this without accusing the authors of cheating or lying. They seem to have addressed most of the venues along which data-dependent analysis would have been possible. They claim to have had full commitment to the window and have used it in previous research (I am not qualified to comment on which window is correct so I’ll leave this one), and to the choice of the two colors together (red/pink). They also analyzed the data with and without excluding certain participants, with similar results (see the original paper and their responses for more detail). When reading through Andrew’s examples of “they could have done different analyses had the data been different”, there are only a few more that remain: a pattern found in one group but not the other could have been interpreted as supporting the hypothesis too, and a result for grey instead of red/pink could have been interpreted as supporting the hypothesis too. But we have data showing that it seems unlikely that these researchers would have done either, given that their follow up article illustrates they understand the relevant issues (e.g., the necessity to collect another sample to test an interaction that was found but not predicted in the original sample; the necessity to replicate a finding, and to publish the result even when it is null). It seems natural to suppose that researchers vary in their understanding of the issues surrounding multiple comparisons or data peeking, and that some engage in these practices with abandon while others proceed with caution.

          Other issues have been brought forward to criticize about the paper (e.g., small samples, design choice, measurement, etc). But the Gelman&Loken paper is about (potential) multiple comparisons, or choosing analyses having peeked at the data and then making strong confirmatory claims! If this is the gist of the paper, I feel that one ought to have some evidence that the authors had a tendency towards this other than “this is always done and so these authors must tend to do this as well” and “this is a sensationalist and highly unlikely to be true a priori claim published in Psych Science”, or else one ought to use hypothetical examples to illustrate points.

          My comment has nothing to do with the believability of the results themselves based on the data available. I would probably side with the skeptics on that as well. My claim is only that a specific accusation is being made as to author’s regular practices for which there is no evidence, and this accusation, in my mind, does have the potential to do damage and thus should be made more responsibly.

        • Replying to Victoria: There is the issue of Tracy and Beall including a large sample of data that they originally planned not to include (look at footnote 1 in the original paper). The fact that the result of the analysis is qualitatively the same when these data are or are not included is irrelevant – it also doesn’t matter why the authors chose to include these data – it represents unequivocally a fork in the analysis and automatically changes the inference that can be made from the reported p-value.

          FYI I did write posts defending the authors because some of the other claims are pretty hypothetical. To me Gelman’s paper would be much stronger if he focused on this point alone.

        • Victoria: Well said.

          Question for Andrew & Anonymous: Regarding the sample, what if Beall and Tracy didn’t expect that so many women would meet the exclusion criteria (22% and 38%)? If they only looked at the data with respect to the exclusion criteria, and only to assess sample size (and thus power), would that still invalidate the p-value for the odds ratio?

          The rationale for the exclusion criteria was to “minimize the inclusion of women for whom effects might be attributable to menstrual or premenstrual symptoms.” So, yes, the sampled population is different, and that alters the inference one would make from this study (e.g., women suffering menstrual symptoms are *less* likely to wear red shirts), but would the p-value itself be OK?

          Finally, Andrew, you say:

          >No shaming going on here, and no singling out. Indeed, if you click through to our paper, you’ll see a discussion of several papers that have similar multiple comparisons problems.

          Gelman and Loken critique 4 studies. Three of them are evolutionary psychology (Durante et al., Peterson et al., and Beall and Tracy), and one is on ESP (Bem), a study that has already attracted a lot of criticism. The probability that you would get 3 ev psych papers with independent draws on the many articles that have been published in the last year that (mis)use NHT are pretty low (p < 0.01, I'd say).

          If you're trying to argue that something is wrong in ev psych research, statistically speaking or otherwise, you should make that explicit.

          My two cents.

        • To Ed,

          As I said the reason doesn’t matter – the bottom line is the criteria for inclusion changed after the data were gathered. I guess in the scenario you describe where the only access to the data is sample size and the fraction of women that didn’t fit their criteria, and if the authors somehow registered that they will now change their inclusion criterion and will stick with this decision regardless of the outcome of the analysis, then perhaps it would be OK.

          I disagree that the fact that 3 articles are EV psych papers shows Gelman is “picking” on these authors or the field. As he said earlier he picked papers that had received some coverage in the press and so it’s not to be taken as a random sample. It can be seen as evidence that EV psych research attracts a lot of attention from the public at large and this is a good thing, I guess.

          That said I really do think that Gelman would be well served to focus on the example where there is clear evidence of forking (the inclusion criteria) and avoid the other speculative comments he made (red/pink etc). I also think it would probably be more appropriate for him to focus the critique on the author of the paper that Tracy and Beall cite to which claims that best practice is to report the results with the data included or not. Clearly this doesn’t solve the problem for the reasons Gelman outlines.

          FWIW, I do think it would be possible to simulate the effect of this kind of data forking for inference on a case by case basis to develop an adjusted P-value.

        • Ed:

          The issues that Eric and I discussed are not special to evolutionary psychology. Much has been written about the propagation of mistaken research claims in medicine (consider the paper by Ioannidis) and neuroscience (consider the papers by Vul and Pashler, and by Button et al.), and in other papers I’ve discussed similar problems with the interpretation of statistical results in other fields such as education policy and public health.

          Why are 3 of our 4 examples in evolutionary psychology? I see two reasons. First, this isn’t quite an independent sample of size 4. The examples have many similar features and I think the central point of our paper is strengthened by the inclusion of several different papers with similar structures that make different decisions on which interactions to consider in their comparisons. Second, all of these are examples which were pointed out to me. Evolutionary psychology has some conceptual links to political science (in that it can be viewed as an alternative way of explaining the world) and maybe that’s one reason people are more likely to point me to articles on that topic, rather than to papers with similar problems that appear in medical research or neuroimaging.

        • Brandon, studies with implausibly large effects, low-power, and bold claims stand out. The issue is not research malpractice. The issue is over-confidence in the claims. It is also not whether the effect is “real” or not. Even if God came down tomorrow and said “Yes, women are more likely to wear red while ovulating” it would not mean Tracy and Beall were “right” and Gelman and Loken were “wrong”. The issue is whether the current data and analysis actually provide much information to affect prior beliefs. I’m willing to believe the effect is “real” and may be on the order of an odds ratio slightly greater than 1. Who knows? A low power study that produces a huge effect actually does very little to change prior beliefs in this case. And the appeal to statistical significance doesn’t help, especially if the p-values underestimate the probability of extreme results.

        • Eric:

          If not God, what would it take for you to accede that “Tracy and Beall were “right” and Gelman and Loken were “wrong”.”

          What I’m wondering is this: Is the Gelman-Loken thesis falsifiable?

        • Rahul,

          It seems you take the Gelman-Loken thesis to be about whether women are more likely to wear pink while ovulating, an empirical claim and thus falsifiable.

          Based on my reading, their thesis is actually about the evidential support a (data, analysis) pair provides for claims like these. That’s about methodology, statistical theory, the scientific method, about what evidence is.

          If God came down and said “Wrong, Gelman-Loken, this data and analysis actually does provide much more evidential support for the empirical claim than you say!” I’d be baffled and would hope God elaborates on why this is the case.

        • @aonon

          No. I take the Gelman-Loken thesis to be about whether a certain paper is or isn’t guilty of the garden of forking paths pathology.

          In, particular they are saying that the Tracy-Beall paper is guilty of this pathology. Which is why, I’d love to know what precise tests to apply to make the multiple comparisons accusation.

          And once so accused, what could Tracy-Beall offer, that’d be an acceptable defense.

          I get the impression that the forking paths themselves are ubiquitous in studies, but we choose to apply the critique selectively to (a) papers whose conclusions are wacky or we disagree with and / or (b) papers that use p-values.

          (b) is especially interesting, that the Gelman-Loken critique superficially seems critical of the forking but seems fundamentally against p-values themselves.

  2. Why do you object to the bit about “put them in contact with the editor of the journal who is handling our paper.”

    Like other things, it may not be the requirement nor the norm but it cannot hurt, can it? I found that part incongruously defensive for someone like Andrew.

  3. You do not believe them when they say that no matter how the data came out they would not have done the peak fertility analysis differently. Then you express disbelief that they won’t take your advice to do the peak fertility analysis differently.

    • Sanjay:

      I don’t see where I expressed disbelief. Researchers are free to report whatever data and analysis they like. But I’m free to then interpret their claims accordingly. I do think it’s bad that they wrote a paper about peak fertility but used the wrong dates for peak fertility.

  4. “but I just happen to be a middle-aged guy with a middle-aged wife so I noticed something funny”

    You made me laugh :-b

    Re forking paths what I like to say is that one of the hardest things in science is creating the conditions for Nature to speak, unhindered by the researcher. That is what a great research design does. This spans more than randomization. It includes sound concepts, measures, instruments, design, implementation, analysis, and inference.

  5. Naive question: What’s the fundamental reason a within-person study is better at this specific task than a between-person study? Are those measurements inherently less noisy? Or less confounding factors? Or….?

    Are there any noteworthy hypothesis where a within person study gave grossly different (better?) results than previous between-person studies? i.e. Are we only theorizing that within-person studies are the fix we are looking for or is there empirical evidence for this?

    Maybe this reflects my ignorance: Is it widely accepted in the field that within person studies are the gold standard in such cases?

    • The hypothesis is specifically about within-person behavior change as the fertility cycle changes. Therefore, the ideal data would be to observe a time series of behavior as a function of ovulation cycle. Only under specific conditions (homogeneity of units, and stationarity of the process) can population between-person covariance structure be assumed to reflect the within-person covariance structure. For this particular example, once you imagine an ideal dataset of N individual time series of length T, you immediately realize that the process must be widely heterogeneous across women (remember the 27 year age span of the sample if you want to think of stationarity, and individual differences in tastes and wardrobes and mating strategies if you want to think about heterogeneity).

      Reflecting on the hypothetical within-person time series data makes you realize (a) that the study as designed must have very low power, (b) that the potential moderators are many and person-specific, and (c) that observing an effect size of 3:1 odds due to fertility cycle in any individual time series would be implausible, let alone claiming that effect as the *population* effect in the initial study. In this debate, the relevance of mentioning within-person designs is to highlight the mismatch of the level of analysis and the thought experiment of what a plausible effect size would be.

    • For simplicity let’s take the simplest case and assume a simple situation with two groups/two time-points being compared. In between-persons analyses you’re comparing two samples (group one vs group two) to each other, which means you have two sources of variability (and two corresponding standard deviations) that contribute ‘noise’. In within-persons designs each person serves as their own control, and the basis of the analysis becomes whether the sample of difference scores (the differences between each person’s score at time 1 vs. time 2) are different from zero — which means you have one source of variability (and one corresponding standard deviation) that contributes ‘noise’. Consequently, within-persons designs are more powerful, i.e., positioned to have a higher signal-to-noise ratio, assuming there’s a signal to detect.

  6. “I think any researchers in this area really need to focus first on studying large effects (where possible) “

    I’m curious why you added the “in this area” qualifier? Is it more necessary for researchers in some areas to restrict themselves to studying large effects than in other areas?

    Is this related to the significance (not statistical) of the findings?

    e.g. If aspirin or wine have a tiny effect on cardiac mortality is it ok to study such effects or not?

    • Rahul:

      I say “large effects,” but, sure, if they’re interested in the topic and there are no large effects, then they can study small effect. That’s why I qualified with “where possible,” Sometimes there are no large effects to be found. But it makes sense to try to look for the large effects first. My main point was that if you mismeasure your x-variable, that will make everything that much more difficult.

      • Well, if the effects are truly large lot of these pesky problems never arise. It is mostly when hunting for small effects that things get tricky.

        I read yours advice as saying, please don’t study small effects.

        • A wise statistician, I’m thinking it was perhaps Jimmie Savage (maybe Andrew remembers) recommended the IntraOcular Traumatic test: Plot the data and if the result hits you between the eyes, it’s significant).

        • All the grouchy old (industry) engineers I ever worked with swore by that principle. And often I’m tempted to agree. Most everything else is an attempt to produce significance where none exists.

        • Rahul:

          No, I’m not saying “please don’t study small effects.” Small effects can be important! What I’m saying is, if you’re going to study small effects, you have to be very careful about measurement and statistical inference.

  7. I have a couple of observations here:

    1. In the Tracy and Beall paper that Andrew originally criticized, why did they have in their data a column for “red” and a column for “pink” (0,1, where 0 means not wearing red/pink, 1 means wearing red pink), but the analysis they reported was not on red alone or pink alone but “red or pink”. If they had a prior guess that “red or pink” was the key thing, why not have a response like “red or pink” as a 0,1 response. There may be a good reason for this coding, but it’s not clear from the paper. [I have the raw data from the original study, that’s how I know this].

    2. As far as I can tell, Andrew never says this, but I think one should concede that it is possible that Tracy and Beall are right. Even if everything they did is wrong from Andrew’s perspective (and I agree), they could still be right. I.e., they may well be chasing a signal and not noise. The way Andrew presents his objections makes it sound like he’s very, very certain about his conclusion. Coming from a Bayesian, that’s pretty odd.

    3. From this post I finally understood what Andrew’s point is: take better measurements and look for large effect sizes. I appreciate the point about taking better quality measurements. But what about situations where the effect sizes *are* small but theoretically important? One example is parafoveal on foveal effects in reading: there is (or maybe was) a controversy in reading research as to whether one picks up linguistically relevant information about word n+1 or n+2 when fixating on word n while reading. These effects are inherently small. You can design an experiment to magnify the effect size, but we are are still going to get relatively small effects. Should one not even study such phenomena? That seems crazy to me, because (to quote someone, I think Ray Jackendoff) that’s like searching for your lost keys on the pavement only where the light’s shining most brightly, rather than where the keys are more likely to have fallen. In linguistics, many effects are very small, but have important implications for theory development. Some people in psycholinguistics do take Andrew’s position seriously and only study huge effects using a particular method (which I will not name), but what they then come up with is trivially obvious results that I don’t even need to do an experiment to establish. The subtlety of the effect cannot be a criterion for deciding whether a problem is worth studying.

    • With regard to point (2) – this is exactly the point. Even granting that the researchers are “correct”, the evidence and the way it is presented provides very little support. I would be happy to grant a substantial prior probability that there exist cyclical behavior changes in women associated with the fertility cycle. In that sense the researchers are chasing signal, as it were. I also think there are sensible limits on the magnitude of what effect might be exhibited. And for the original paper, given the quality of the data and the magnitude of the reported effect, I think that we are left basically with very little information to update what we already believed. That’s where the chasing noise part comes in because the essence of the research claim is that having achieved “significance”, there is now confirmation that the effect is “real”.

      The second study, in my opinion, is not helpful either for the authors for several reasons. It’s basically a classic example of the reasoning process mentioned in the Meehl quote a few posts back. The best contribution of the second study is to ratchet down the hype about the effect size. The downside of the second study is that the logic, interpretation, and data portend very poorly for the future positioning of this discovery. When all the moderating effects and conditional statements, and study design elements are exhausted down the road, we will find ourselves essentially where we started – with our prior belief that “yeah, something like this must be going at some level in the population.” That is the basis of the advice to go after the signal and not the noise.

    • Shravan:

      1. In their original paper, Beall and Tracy report tests for several different colors. Their prior guess might have been “red or pink” but other significant findings would’ve fit their general research hypotheses too. As Eric and I noted in our paper:

      the authors found a statistically significant pattern after combining red and pink, but had they found it only for red, or only for pink, this would have fit their theories too. In their words: “The theory we were testing is based on the idea that red and shades of red (such as the pinkish swellings seen in ovulating chimpanzees, or the pinkish skin tone observed in attractive and healthy human faces) are associated with sexual interest and attractiveness.” Had their data popped out with a statistically significant difference on pink and not on red, that would have been news too. And suppose that white and gray had come up as the more frequent colors? One could easily argue that more bland colors serve to highlight the pink colors of a (European-colored) face.

      2. As Eric and I write in our article:

      We are not saying the scientific claims in these papers are necessarily wrong. Maybe there really do exist large ESP effects. Maybe there really are consistent differences in the colors that women wear during different parts of their cycle. And so on. What we are saying is that the evidence in these research papers is not as strong as stated. . . . The scientific hypotheses in the papers at hand are general enough that they could have some validity even if the particular claims in the published claims do not hold.

      So I think we are in agreement here.

      3. If you’re interested in studying a small effect (and I agree that small effects can be important) then I think you need large sample size and very careful measurement. Careful measurement is required not just for concerns of statistical power (to make the signal show up amidst the noise) but because if the effect of interest is small, you have to make sure that your systematic errors are even smaller. Beyond all this, you have to be aware that effects can be small, and so patterns found in a sample will not necessarily tell us much if anything useful about the population.

      • “I think you need large sample size and very careful measurement.”

        Great! Maybe I need to have that sentence framed in shining red (or pink) and have them blinking annoyingly on my home page.

        I’m looking forward to the day that psycholinguists and linguists start running high powered studies with very careful measurement procedures, and stop drawing bold conclusions from null results in low-powered studies. But I’m not holding my breath.

        • Yes, that’s why I made the point about a PhD thesis taking 2 years or more. Who said that discovering enduring truths about nature should be easy, just requiring a bunch of 5-minute questionnaires and the computation of a few p-values?

      • As Andrew says, small effects can be important, so you need large sample size and very careful measurement.

        “Very careful measurement” is the difficult part. I’m an astronomer, and we chase very small effects all the time. My work with the Hubble Telescope involved measuring angles as small as 0.01 seconds of arc, less than one one-hundred-millionth of a circle. When measuring very small effects it is essential that systematic effects be carefully measured and accounted for, otherwise your “effect” might just be a systematic error of some sort. The rub is that every experiment is different, and accounting for these systematics can be quite tricky.

        • Bill:

          One important detail that you sort of hit on is the problem of calibrating your measuring device. In astronomy, you have well understood reference objects in the sky as well as (I presume) in the lab, to make sure you can discern the separation of objects with a given luminosity by milli-arcseconds. Calibration also is a useful test of the analysis procedure in that you know where-abouts the “right answer” is.

          It’s not at all clear to me how one would do the equivalent calibration experiment for the type of research found in Tracy&Beall. Though that is more likely a mark of my ignorance than anything else.

        • West, you are absolutely correct. In fact I think that measuring these physical things with instruments the you can understand in physical terms may make our job as astronomers easier (though I can think of numerous examples where systematic error has led incautious researchers astray).

          The Tracy & Beall kind of research is as you say much more problematic. The most powerful tool to deal with this sort of question is the RCT idea; but it’s not clear to me that they used it in this paper we’re talking about.

        • The only thing that comes to mind is to run an analysis comparing “wear pink” or “high fertility” to something completely unrelated. But there are so many confounding variables that I can’t see how one could get a baseline for “what I am measuring is noise.” Without that, I have a hard time believing any sort of significance statement.

      • You’re criticizing them because of something you found on “It was on the internet” is enough to impinge the knowledge of serious researchers? “Don’t Beall and Tracy care … that they might have gotten the dates of peak fertility wrong” .. 15 minutes with Google and really you feel you know the biology and the literature of their field better than they do and that you can judge that by not using they don’t care? You feel that they need to discuss it with you without you having to read whatever the literature is on that? Did you consider that possibility that the “data” might actually be based on some other assumptions or ideas i.e. trying to create a confidence interval around predicted date of ovulation (mainly for people who are having trouble conceiving)? Is ovulation plus or minus a fixed number of days your definition of peak fertility? Why? Is it everyone else’s? That really does not help your argument and seems pretty patronizing to boot. I would take that whole entire section out of your paper since it just is a distraction at best and seems ignorant and patronizing at worst. There is lots to criticize in that research but this is just a tangent unless you have real evidence that they went fishing or that the results would change if they did.

        • Anonymous:

          1. We linked to the U.S. Department of Health and Human Services website which gives days 10-17. We also noted just to indicate that this information is not hard to find from other sources. And, no, the numbers are not ovulation plus or minus a fixed number of days: the stated days of peak fertility give more days before ovulation than after.

          2. I never said that Beall and Tracy needed to discuss anything with me. I just think they made a mistake. It’s not about my definition, I’m going with the medical experts unless I see good evidence otherwise.

        • They used a “typical” idealized cycle.

          They are essentially saying that it is reasonable to assume any mis-categorizations washed out in the averaging process. However, I don’t think we have enough information on the accuracy of the self-reports and distribution of cycle-lengths to say.

        • So on the basis of two websites for women trying to predict date of ovulation (not “peak fertility” whatever that is defined as) you conclude that they “made a mistake.” Well on that basis I think you “made a mistake” of the kind that is pretty common when assuming that a field other than your own is something you can do serious discussion of based on reading some internet sources for lay audiences that may or may not be using the same definitions. I’m not going to give you a high school discussion of human reproduction, about whether you are better off inseminating before or after ovulation etc but the whole thing just makes your argument about the larger issue weaker by seeming shallow and patronizing or worse (claiming they don’t care). And to find the ‘medical experts” I’d consult the best literature review you can find, not a web site calculator that you assume you understand the structure behind. Both of those sites as well as Beall and Tracy know that within subject design where you are taking basal temperatures or testing the mucous would be much better. That should really be the main point not some side issue discussion that you do not have expertise in. It’s a good example of how adding one thing too many can end up making the entire argument seem suspect.

        • Anon:

          I don’t get you here? Are you saying that the days of peak fertility are days 6-14? You might think I’m shallow to be relying on a website from the U.S. Heath and Human Services (which confirmed what my wife’s gynecologist had told us), but that’s what I’m going with until you or Beall or Tracy or whatever can point so something better.

        • Anon:

          To put it another way: I’ve done some research. I tracked down the references in Beall and Tracy’s article, I looked on the HHS website and various other websites. That’s not a lot of research but it’s something: You can learn a lot from reading references and looking at websites. And what I found was no evidence that days 6-14 were the days of peak fertility, and lots of evidence that days 10-17 was a reasonable consensus choice (I also came across numbers like days 12-17 and days 9-14). If you have some better information on this, feel free to share it.

        • What is your definition of peak fertility and why are you assuming that it maps exactly to the confidence interval around a point estimate of ovulation date? For example a quick search through Google Scholar brought up which says count backwards 14 days from start of menses 2 to estimate ovulation then 0-7 days before that are considered high fertility. os tje artoc;e B&T cite for the 7-14 also using the reverse counting approach. This pair of articles discusses the issue in some detail (see in partiular the figure on page 46) and and the pair would be a much better citation for you because it highlights the fact that forward counting, backward counting and other ways of determining the window as well as the decision about the size of the window will potentially have serious impact on results.

          No one in the world could make me spend time researching this topic, but the fact is this whole side issue of whether you know better than the people who do this research about what the right way to operationalize “peak ferrtility” is an issue that is not important to (and is a distraction from) the argument of the article, what is important in your article is the issue of forking paths. The fact that you have gone down this rabbit hole is a complete distraction from it and makes it seem personal. Now that you seem to have done some reading if you don’t want to take that section out maybe you should rewrite it to talk about how there is actual discussion of this operationalization in the literature and that itself leads back to the problem the paper is about.

        • I am not assuming that peak fertility “maps exactly to the confidence interval around a point estimate of ovulation date.” That is all coming from you, indeed in a comment earlier I noted that those dates are not ovulation plus or minus a fixed number of days: the stated days of peak fertility give more days before ovulation than after. I’m taking the definition from the U.S. Health and Human Services site.

  8. This might not be the ideal venue to ask, but what exactly should researchers do with respect to future design with data/studies like this? Is it better to come up with the most direct replication possible for the sake of sticking as close as possible to the research, or to address the flaws and do a well designed investigation?

    In any case, a preregistered multisite replication is underway at Michigan State University and the University of Amsterdam. So, we should get some new data in a couple of years. University of Amsterdam is using a within-person design as well.

    An academic citation for fertility/ovulation:
    Wilcox AJ, Dunson DB, Weinberg CR, Trussell J, and Baird DD. 2001. Likelihood of conception with a single act of intercourse: providing benchmark rates for assessment of post-coital contraceptives. Contraception 63(4):211-215.

    • Nadia:

      The study of fertility and clothing choices doesn’t really interest me, but if we want to speak more generally about studies about biological cycles, hormones, and social behavior, I’d recommend much more accurate measurements of the biological x-variables of interest, and recording of many different y-variables. You could pre-register part of your study but I’d recommend looking at lots of data in your analysis and having a more open-ended focus.

      Instead of naively thinking that the science is settled and that the next paper will offer definitive proof, I’d recommend a large exploratory study with careful measurements and a within-person design, with the goal being to learn and form hypotheses which could then be confirmed in a pre-registered replication. To me, a big problem with the Beall and Tracy papers is that they’re trying to jump straight to confirmation without first being clear on exactly what they are looking for.

      • Well, if the next independent paper again shows the same conclusions linked with weather I’d tend to believe these results much more.

        I think that’s a move in the right direction. I don’t see why we need to make the whole picture fuzzier by dragging in a demand for an exploratory study now.

        Let’s not hedge. Let’s at least be prepared to trust independent replication, by a disinterested third party? At least now what they are looking for seems clearly pre-advertised with all this discussion. The forthcoming studies Nadia mentions seem to be assuaging both the main criticisms by (a) pre-registration & (b) within person designs.

        Fine, you can say, one more study isn’t enough, skeptical-me wants 4 more studies. That’s ok. But already positioning oneself to be skeptical of all not-yet-finished independent replications seems too pessimistic.

        • Rahul:

          I don’t disagree with you in principle. But in this particular case I don’t see evidence that there’s anything there at all. That’s why I said that, if someone wants to study this, they should do a within-person study with careful measurements. The existing data are so noisy that I think that any new study would essentially be starting from scratch. It’s not that I’m “demanding” an exploratory study; I’m just recommending that anyone who’s serious about the research start from there. To take these two published studies as good data and then think that one more study of that kind is going to nail things down . . . hmm, to me, that would be like taking the lineup of the 1984 Clippers and thinking that with a few good breaks you’d have a shot at the title. Sorry to be pessimistic but that’s how it looks to me. Again, I think one problem here is the expectation that a researcher should expect a good chance of discovering eternal scientific truth from a noisy study of a poorly-understood phenomenon.

        • Andrew:

          Interestingly, I think these Tracy / Beall conclusions are bogus too but it makes me welcome all the replications that I can get. Because my priors make me think these replications will fail & that only emphasizes the point.

      • Thanks. There has been some controversy over p-hacking and multiple comparisons in this subfield, and some investigators have argued that a continuous measure of fertility is ideal because investigators do not have to choose dates. Also, you argued in your other post that psychologists should analyze all of their data, and the continous method allows for that.

        One of your post scripts mentions stats and science practice. Cumming at Psychological Science talks about the new statistics, and that might touch upon topics relevant for discussion.

  9. Andrew, they write on their blog:

    “Despite repeated requests from us, Gelman and Loken are unwilling to provide us with any information about their paper’s publication status that would allow us to ensure that we are included in the review process (e.g., they have refused to inform us of the name of the editor handling the manuscript, the journal where the paper is under review, and even whether it is in fact currently under review).”

    Is all this true? What’s the reason for all the secrecy? I would release all the information they ask for. I can’t see any reason to withhold this information. They should get a chance to provide a rebuttal to your paper (maybe accompanied with an open release of their own data on their website so that others can dig into it).

    • Shravan:

      As noted above, we will inform the editor of Beall and Tracy’s concerns. But I don’t think it’s appropriate for them to be involved in the editorial process. Similarly, if they or anyone else wants to publish an article discussing something I wrote, I don’t think that I should have the right to interpose myself and contact the editor of their paper before publication. Once the paper is scheduled for publication, it would be fine for them to discuss it. I’d say the same for Daryl Bem and the authors of the other papers discussed in our article. In the meantime, I’ve linked here to Tracy and Beall’s response, so they can get the same readership that I get.

      • Hmm…I’d disagree. I think it’s the editor’s call whether they ought to be involved or not.

        It’s perfectly fine for *you* to hold an opinion that they ought not to be involved. But I’d still be happier had you pointed them to the Editor & then he made that final call.

        In principle I don’t see how their offering their side of the story or arguments to the Editor or referees tramples upon your rights in any way. If you trust the Editors & Referees as upright, knowledgeable people, why *wouldn’t* you want them to hear all the facts (or viewpoints) when they critique your paper?

        • Rahul:

          Maybe so. I’ve just never seen it done this way before, where the author of a paper being discussed goes and intervenes in the publication process. But maybe you (and they) are right. In any case, as noted above, we’re planning to tell the editor about this (all this has happened in the past week or so), so the editors will be able to do whatever they think is best.

        • I have much less experience than you at this, but I have a paper draft that discusses some methodological problems in someone else’s paper. I felt it necessary to be in touch with the authors (well, one of them) when I first found some weird results in my re-analysis. And I made it clear I would keep them in the loop on what I was going to do with the paper (that just seemed polite – no reason they should wake up some day and find my paper in journal without knowing it would be there). I’ll even send them a draft before hand: they know the data really well, and the nuances of the study, so why wouldn’t I want their constructive input.

          But I think I’d feel pretty weird if they wrote me and said “Now that you are submitting this paper, please inform us where it is being submitted and who is assigned as editor,” I’d probably assume they were trying to poison the process (this is not me implying anything about any particular interlocuting authors – mine or yours – just me implying things about myself and how I’d likely respond to the situation). Sharing data, drafts, thoughts, comments, etc – that is open discourse and open science. Demanding authors you disagree with give you special privileges in the evaluation of papers critical of your work – that seems like something else. If the editor wants their input, s/he can ask for it.

        • How is it a privilege to tell them what Journal you are submitting to? It might be a privilege if the Editor decides to hear them out, sure. But whether to do that or not is the editor’s privilege not the submitting author’s.

          But to think that one does them a favor by telling them where you are submitting seems a stretch. That isn’t exactly the most highly secret information. Or it shouldn’t be at least.

          If you assume they can poison the process, one must question what opinions one has of the Editor / Referees, if one thinks they can be so easily influenced.

        • Rahul:

          Usually I do not publicize where I will be publishing a paper, until it is through the revision process. So it would be special treatment for me to involve the authors of a paper we discussed be involved in the process at this stage.

        • Andrew:

          Yes, sure. But not burdensome to the point where it deserves being labeled a “privilege”?

          e.g. Usually I don’t announce to my students whether I use vim or emacs either. But if someone asked, I see no reason not to tell. Nor would I tell him he’s getting a special “privilege” here.

        • Rahul:

          The privilege is not their knowing the name of the publication in which the article is scheduled to appear. It’s in their request to be involved in the editorial process before the revision process is over. That is what is unusual.

        • I have been involved in a situation similar to that described by jrc, where we had a paper that revealed wrinkles in a published paper and which we raised with the authors of the original paper before publication. In the end we had a peer review process where — although anonymous — we could be fairly sure two of the reviewers were authors on the original paper. I think this worked out well for us, but I can see that providing a privileged role to the original authors could produce greater barriers for publishing criticisms of published work as compared to the original research. As discussed on this blog previously, this is probably not a good thing, and having survived the original peer-review should not immunize a paper from criticism.

          To Rahul’s suggestion that the editor can work it all out: maybe, but I can see (1) the editor might feel less of an expert in a particular field than both of the parties in a particular dispute and so find it difficult to adjudicate, and (2) the editor is probably operating with a value function with includes more arguments than just adjudicating what is right or wrong for this particular paper — they have a lot of papers to work on, so if there is an unresolved dispute maybe it is just easier to take a pass on this particular paper.

        • Given that Gelman & Loken cite the paper, the editors can decide whether they want to contact them. Isn’t that how such things would normally work?

          If I write a paper building on our critiquing another paper, one of the authors might be invited to be a reviewer. Or might not. In some cases, the submission process solicits suggestions for who would be a good reviewer and who should not be invited as a reviewer (though neither of these suggestions need be honored).

          Additionally in this case, it seems like the reviewers of the Gelman & Loken paper should be experts in relevant methods or statistics — not necessarily the researchers whose work they happen to use as examples.

        • I think having Tracy / Beall as a reviewer would be a bad idea. But there’s no harm in allowing them to submit their viewpoint to the actual referees.

        • Eh?? The referees are there to evaluate the submitted paper on the merits of the work or arguments presented; are they supported by data, results etc. It would be wholly inappropriate for Tracy or Beall to put their point forward to the referees at that stage. Assuming the paper is accepted for publication, I would expect any reasonable journal/editor to allow Tracy & Beall to submit a response to Gelman & Loken, which would also be reviewed (without intervention from Gelman or Loken).

          Once an editor determines who will act as referees for a given paper, we can’t allow other interested/invested parties to interject in that process, unless the review process is entirely open and transparent with an open discussion alongside formal review, as practised at some journals.

        • >In principle I don’t see how their offering their side of the story or arguments to the Editor or referees tramples upon your rights in any way..

          In many domains of life & business, deliberations & negotiations are conducted in private. “How about that smoke coming out of the Vatican City chimney?” People in general must be very sensitive & serious about this, since legal action can be taken if one side feels that a 3rd party has interfered in a way that has caused economic harm:

          I’m not at all suggesting or implying a legal angle to this little controversy. I’m simply saying it’s IMO unfair to expect Andrew to behave differently than anyone else would in a similar situation. FWIW, I think Andrew is absolutely taking the smart path by holding some of his cards close to his chest. “Always Be Closing” ;)

        • As an Editor I think this is a _horrible_ idea. The editor gets to pick reviewers to send the article to and is free to choose the authors of papers criticised or cited (and sometimes does). Adding a layer where those authors are adding directly to the review process in addition to the reviewers is a nightmare scenario (given that peer review is already quite a long process).

          It is slightly different for commentaries (which this paper isn’t), but even then it can be hard to get legitimate criticisms published when they are reviewed by the original authors of a paper.

  10. For me, a key quote in Andrew’s post was, “their work has lots of science-based theorizing (arguments involving evolution, fertility, etc.) and lots of p-values (statistical significance is what gets your work published), but not a strong connection between the two,” which I think that is a pretty apt description of a lot of the Psych Science type of work.

    In general, I think Tracy and Beall have a good attitude toward parts of their scientific approach. They find an interesting result. They check on it and discover it does not hold up, so they refine the interpretation. That’s what scientists _should_ do. The problem is that their inference method (p-values) does not allow them to behave in this (proper) way.

    One sees the same issues in discussions about replication (such as the recent replication efforts reported in the journal Social Psychology). I am not so concerned that many of the replications failed, but I am bothered that the field took the original studies seriously. Many of those studies seem to have the disconnect between methods and scientific intent mentioned in the above quote. Replications (failed or successful) are not going to fix these problems.

    • One thing I don’t get in this and other discussions like these on this blog, which dump on p-values, is: if we take the idea of running high power studies seriously, then the likelihood is going to dominate in the Bayesian analysis. The conclusion that you’d draw (the decision you make based on the data) from a p-value vs the HPD or credible interval is going to be the same. My understanding is that with high power studies this statement is true. If so, it’s misguided to attack p-values per se; the problem is low power and measurement precision, as discussed earlier. I’m happy to be corrected about this.

      • Shravan,

        As I understand it, p-values work just fine for a single test. By “work” I mean that the reported p-value corresponds to what was intended. The problem is that typically there is not a single test, so the p-value no longer means what was intended. Given that scientists want (and should) explore their data, there is an unknown number of possible tests (whether actually performed or not), so a p-value cannot mean what was intended.

        I agree that if you have high-powered studies, then you often reach the same conclusion as for other methods, but the articles we are talking about are generally not high-powered studies (although they almost always find a successful outcome:

        The deeper issue, which I think is what you were getting at, is about the decision being made. I guess the Tracy and Beall conclusion is given at the end of the abstract of their Psych Sci article, “Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue.” I don’t think such a conclusion/decision can be interpreted without some additional context. In particular, I think it is missing a “so…” clause that indicates how this conclusion might influence anything else that depends on the data.

        For example, suppose a scientist thought their findings were relevant to clothing retailers. The final clause might be “…so retailers could sell more shirts by marketing pink or red shirts to women during peak fertility.” Clearly, the retailer needs a lot more information (how to identify peak fertility, the definition of peak fertility, how many more red/pink shirts to manufacture, the uncertainty of the measures, and so forth). To make this kind of application meaningful one needs a lot more than a decision that there _is_ an effect. One needs to know the magnitude of the effect, the uncertainty of the measurement, and many other details.

        As another example, suppose a scientist thought their findings were useful to identify women at peak fertility and thereby exclude them from some other study. The final clause might be “…so researchers can exclude data from women wearing red or pink because such women are more likely to be at peak fertility.” Again, anyone wanting to put that idea to practice needs a lot more information (how big is the effect, how effectively wearing red/pink identifies peak fertility – which seems to be an inference in the opposite direction, and so forth).

        One problem with these kinds of studies (indeed with many studies in psychology) is that the conclusions just sort of sit there without any implication for much of anything. If we think about how these kinds of findings might actually be useful, then we realize that the decision/conclusion we derive from a study is often just a short description of the data. For any practical application, we have to consider many more properties of the data and the context within which the data will be used. In particular, as Andrew notes, Tracy and Beall (or anyone else) would care quite a bit about the definition of peak fertility and how it is measured.

        This is a long way of saying that I don’t think decisions/conclusions of this type are useful without appropriate context. To just say that an effect “exists” hardly requires a p-value and hardly seems like a useful statement to make. Such conclusions seem inherent in the use of p-values, but may not be inherent in other data analysis methods.

        • I understand your point (and interesting looking article; I will read it carefully; I think that we have the same problems in psycholinguistics that you point out in this article).

          For people like me, there is no practical implication of the type you suggest, but there is a practical implication in terms of the architecture and mental processes of cognitive processes that we should assume/implement given data. I am assuming that for these authors there are similar theoretical implications that constitute the decision that must be made. For their theory development, they have to either act as if this red/pink thing is true, or not.

        • I think you are correct that they (meaning many psychologists) use decisions based on p-values as a means of theory development: if p<.05, then include the term in the theory/model, otherwise leave it out. Unfortunately, that's a horrible way to build a theory/model. Piecewise decisions, which control Type I (or II) error for individual tests, accumulate potential error across multiple decisions (both in terms of including terms that do not really have an effect and in excluding terms that really do have an effect).

          I don't have all the answers, but I think there has to be a better way to build a model/theory than the piecewise decisions of hypothesis testing. That's true for both frequentist and Bayesian approaches.

      • I think the ‘garden of forking paths’ criticism applies with nearly equal to typical uses of Bayesian posteriors; it’s just that Tracy & Beall happen to have used p-values in this particular case. The only way to be confident that you haven’t informally sniffed through your data and made your ultimate statistical analysis dependent on the particular properties of your sample is something like pre-registration of your hypotheses prior to looking at the data (or pre-registration of a genuinely formal & deductive theory, such that the hypotheses teased out of it were implicitly pre-stated—but this level of formality is very rare in psych)

    • Greg:

      There’s also this weird thing that they don’t seem bothered at all by getting the dates of peak fertility wrong. This says a lot to me, not about Beall and Tracy in particular—as I’ve said repeatedly, I have no reason to think they’re not conscientious researchers—but about the attitude that all that matters is statistical significance, and it doesn’t matter so much how you got there.

      • Perhaps I don’t understand this, because it’s more complicated than it seems – because to me, it seems really, really simple. Say you want to investigate birds’ tweeting behavior at dawn, which is beween, say, 05:00-07:00. But actually, you observe between 03:00-05:00 (because there is an error in the automatic clock that activates recordings). Then, however breathtaking your eventual results – they don’t tell you anything about birds’ tweeting behavior at dawn. Or A take chemist who pretends to measure the activity of a protein at pH 6.5-7.5, but erroneously (say, a calibration error of the pH electrode) shifts the range up or down will have a result, but not the one she thinks she has.

        Now, contrary to my fictious examples (at least the first one), there is an overlap between their peak fertility window and the one you are suggesting. But it seems obvious that they should really, really sort that out. As I understand it, this doesn’t even have anything to do with “statistics” in any but the most trivial way – ?

        • Here’s what I don’t understand. There are two different possibilities here: the first is that women dress differently when they are actually at peak fertility, whether they know when that is or not, presumably from some deep sociobiological imperative. if that’s true, then they should dress differently at their own actual fertile time, not from some estimate of some window of averages. The other possibility is that in some direct psychological way women dress differently when they think they’re fertile. But in the case actual fertility windows are irrelevant, unless someone shows women are accurate about this, which seems belied by the need for medical advice of the sort Andrew discussed.

        • Martin:

          But this does have to do with statistics! Measurement is central to statistics. I think we have a problem in statistics education, that we spend very little time on measurement issues. We typically assume that the measured “x” is the same as the “x” we care about, and the measured “y” is the same as the “y” we care about, and measurement-error models are typically in some small section tucked away in the middle of the textbook. (That’s how it is in my textbook, I’m sorry to say.)

          Some central issues of statistics:
          – Interpolating and extrapolating a function
          – Generalizing from sample to population
          – Generalizing from the thing you measured to the thing you care about.

          All of these are important, but I fear that in our teaching and textbooks, we understate the importance of measurement, which leads to problems of the sort discussed here.

        • Andrew,

          Sorry, I think there is a misunderstanding (and indeed, I put it in a way that forced it).

          I do not doubt that this is an important issue in statistics, just that I do not see how one has to know any statistics, at all, to see the specific problem you raise. Perhaps there are less clear-cut examples. But as I understand the point here, the authors claimed to have found an effect during a peak fertility, but they got the time window wrong. I have no doubts that a good statistics education has to emphasize such points, I just think that in this specific example the point is blatantly obvious to anybody who has a highscool degree (in terms of understanding it; to spot it is another issue).

          In fact, it made me think of a mistake made by a colleague of mine. He once had to revise months of work of chemistry because he got the stoichimoetry wrong: he had a reaction between two molcules A and B, but he calculates the mass of A incorrectly (he omitted the counterion, a silly mistake, but it happens). So, instead of haveing a 1:1 reaction w/r/t numbers of molecules, he had something like a 0.78:1 reaction. Not all was lost: he did have a reaction result (new molecules that were characterised, and showed catalytic activity), but the interpretation had to be changed considerably (and of course, the actual ratio was completely unmotivated). Here again: of course getting the stoichiometry right is part of any chemistry course – but everybody who has learned how to add and multiply number should understand the obvious mistake here. This was not some intricate issue concerning reaction kinetics, where a serious of scholars with detailed knowledge about this type of chemistry had to argue about the problem: once it was spotted, it was a simple mistake, it was onviously not intentional, and anybody could understand (as well as commit) it. (Btw. he was lucky: redoing the work with the correct ratio, he got the same products, just better yields. Or perhaps he was unlucky, because if it had yielded different products, he’d have had another result.) Had he just published it, the actual numbers xold have been deeply burried in an “experimental section”, and no reviewer checks those anyway.

        • “Had he just published it, the actual numbers xold have been deeply burried in an “experimental section”, and no reviewer checks those anyway.”

          That would be dereliction of duty on part of the reviewer. I check everything that’s written when I review a paper.

        • Well, yes. That’s bad, but is it surprising? I won’t go as far as to suggest that you are the exception, but given the mistakes to be found in experimental sections, it is certainly not universal – three reviewers cannot all make the same mutiplication error as the author, if the really checked the calculations (rather than reading through them for overall plausibility). Perhaps it’s a feature of the field: it’s an open secret that you have to redo those calculations, not as a principle, but because they are often wrong. Also, these are typically sections where masses of a compound weighed at two significant decimal places miraculously tranform into 4 after carrying out the calculations, without changing the mass unit.

          There are also other issues: for example, people often report melting points of compounds after the evaporation of a solvent. But one really should has to report the melting points of crystallized compounds for reproducibility. It’s an open secret that many of these numbers are garbage. Or sometimes people report “yields” based on gas chromatographic analysis of the reaction mixture – meaning that they integrate the product signal in the mixture. Of course, this doesn’t tell you anything about the isolated yield, i.e. what you can actually get out of these procedures, and thus if it is a useful process or not. Or, would you be surprised that in this patent a yield of over 100% is reported?

        • I wouldn’t even call it lazyness, it’s just very adapted to the immediate needs of the subfield in qeustion. It’s a pity, because some cheap quality control that does not involve much work would make all this more reliable.

          For example, the melting point issue is understandable in a very narrow sense: a synthetic chemist who wants to redo a synthesis and needs just very general evidence if the reaction has worked, will be happy enough about some sorta-kinda available melting point (she’ll do an NMR anyway – here, the solvent signal is often not corrected, so there…) – and if she follows the procedure reported, there is a chance that she’ll have at least a similar melting point. But a physical chemist studying, say, phase transitions cannot use those data, as it is not clear if the melting point is an actual solid-liquid transformation, or rather a glass-liquid transition of a compound containing some solvent and silica residues. That’s a pity, as better practice could make these data much more useful.

        • Martin:

          I think measurement is central to statistics just as sampling is central to statistics. If someone does a highly non-random and non-representative poll and gets bad inference for the population, we’d say their survey has a statistical flaw. Similarly, if someone measures A and makes a claim about B, I’d call this a statistical flaw too.

          Here’s the problem: Statistics training at all levels makes clear the problems of nonrandom sampling, and that’s great. But statistics training does not always emphasize the importance of accurate measurement. Instead, sometimes people get the impression that if the p-value is less than 0.05, that you have statistical significance and you’ve won. People forget that just cos you reject a null hypothesis, this does not mean you have strong evidence in support of your favorite alternative.

        • “sometimes people get the impression that if the p-value is less than 0.05, that you have statistical significance and you’ve won.”

          What do you mean sometimes? And in a way they really have won: it’s the only ticket to publication glory today.

        • +1

          And is not only measurement, research design is often not dealt with explicitly either. Most textbooks and courses focus on estimation and mathematical statistics, yet these methods won’t typically save you from a bad design.

  11. even without considering a “garden of forking paths”, we still have a multiple comparison problem due to the multiplicity of researchers. where does it stop? how does one define the number of hypothetical comparisons? It’s such a big deal in NHST hypothesis testing “correctly”, and yet is so ill-defined. Worse yet, it has no relationship to the “probability of being correct”, which is why most researchers _really_ want to get at when they apply multiple comparison in the first place.

    To me the bottom line is that the logic of familywise error rates + multiple comparison adjustments is an epistemological dead end.

  12. Does it really that important to get the days of max fertility right? I mean, it is downright embarrassing not to know the answer for such extensively researched subject and, as it appears that there is a fair amount of variation between women, not to even attempt an individualized account (that is, assess what is max/min period for a particular participant and do the calculation from there). But possibly there are some sorts of precursors for max fertility days and whatever is suspected mechanism that allegedly makes women prefer a certain coloring under certain weather it might depend on these precursors. Evolution is not a perfect mechanism.

    The main problem that I (far from being a knowledgeable commenter) don’t get about such research is not poor statistical support or implausible hypothesis, but very low added value that the positive result might get. Suppose Tracy and Beall proved their point beyond doubt, so what? There are so many things that can mediate menstrual cycle and fashion preferences that we would have no idea where to start. And if you are interested in those mechanisms (which would be a fascinating scientific question) why don’t study them in more direct way?

    • D.O.:

      Regarding your first paragraph, “evolution is not a perfect mechanism,” sure. That’s why we said we could equally imagine pink being the popular color in certain days and not red, or gray being more popular during fertile periods, etc., Beall and Tracy discussed the idea of different behavior during different weather, and one can easily imagine variation based on age, marital status, religiosity, income, etc etc etc. Once you open the door to these other possibilities, you get the multiple comparisons problems that got all this started. So, I think it’s ok to be completely open about which days are the important ones, but if you want to do this, I think the analysis should be more open-ended. Once you accept the complexity of the threads linking evolutionary pressures to choices of clothing, I don’t think it makes sense to put down a single choice.

  13. One point I’m confused about is that this garden of forking paths seems almost ubiquitous in most published papers. Most studies I read make implicit decisions & even those that are pre-registered don’t enumerate things to the level of detail which precludes this multiple comparisons problem entirely.

    (a) Does Andrew have papers in mind (not exploratory ones) which have effectively removed this forking paths problem & (b) Do we only apply this criticism to those papers whose conclusions we don’t like?

    • Rahul:

      This question deserves a separate post, but, just very briefly, I’ve published many applied statistical analysis and I don’t think I’ve ever used a preregistered replication. So you can take a look at my published papers to get a sense of what I’m talking about.

      • Are you saying your studies are somehow immune from the multiple comparisons problem? How?

        Maybe so, but I’d love some more general tests to know if or not a certain study is indeed guilty of the forking paths pathology. Or are you saying that this is a pathology mostly associated with p-values so the way to avoid multiple comparisons is to avoid p-values?

        Otherwise, what’s the magic ingredient if not pre-registration even?

        • Rahul:

          I do think p-values make things worse. Jennifer and I have a paper on how you don’t need to worry so much about multiple comparisons if you do hierarchical models.

        • I remember the paper you mention here, on hierarchical models and multiple comparisons. How would you apply that strategy to the situation Beall and Tracy are in, which is to say the situation that most empirical scientists are in?

          I remember feeling like your solution was a great end-run around the multiple comparisons problem for a subset of multiple comparisons problems that had basically no connection I could see to any problem I was likely to encounter. I think the same holds true here, sadly, but am happy to be shown wrong.

        • Erin:

          In this case I’d do something similar to what Aki and I did in our birthdays example (see the cover of BDA3). Instead of pre-choosing a window, I’d estimate the outcome as a smooth function of day of cycle. Of course you’d a lot more data for that, but that’s the way it goes. If you want to learn, sometimes it takes some work.

        • Huh? I mean, sure, I think that’d be a great way to approach looking for a phase effect, in general! I dislike categorizing continuous variables as much as the next girl. But I don’t see how it solves the problem that you’re still choosing one path out of many. OK, maybe you’ve dealt with the “fertility window” dimension of the problem. But all the other data-dependent analytical choices are still there. I don’t see how any kind of model shrinkage can deal with those. What am I missing?

        • Erin:

          If I were particularly interested in peak fertility, I’d try to measure it directly or else I’d use the standard definition. But ultimately I think this sort of study is more open-ended, there are lots of different possible theories relating to different parts of the menstrual cycle (presumably this is one reason why Tracy and Beall appear to be sticking with days 6-14 despite these not being the days of peak fertility according to the sources I’ve seen), also different patterns of dressing would fit their general hypothesis (as Eric and I noted in our paper, one might argue that gray colors would be better to highlight the pink of a face, and one might also argue for negative effects for college students who don’t want to get pregnant, also one could imagine opposite effects for partnered and unpartnered women as in that other study we discussed, etc.). Put this all together and you have a lot of potentially interesting patterns so I don’t think it would make sense to choose one path. I’d prefer to fit a multilevel model allowing all sorts of variation, the same sort of thing we do when we analyze opinion poll data that varies by ethnicity, income, state, etc. Again, do to this in any serious way would require better data (in particular, multiple measurements on individual people) and also a much larger sample size.

        • @Erin @Andrew

          There is a new and very exciting method one can use in these sorts of situations: Simplicity.

          If you want to test a theory, as opposed to do exploratory analysis, make the theory as detailed as possible, back out the single sharpest and most discerning theoretical prediction, choose the most relevant measures and accurate measurement instruments, and design the simplest most powerful test you can. (You can adapt this advice to estimating quantities of interest too).

          “Simplicity is not a given. It is an achievement” William H Gass, quoted in Rosenbaum P.R. “Design of Observational Studies” pg VII.

          “Transparency strongly encourages the use of the simplest methods that will be adequate” D. Cox, op.cit. pg 147

          “A main objective of research design is thus to simplify the conditions of observation”, M Susser, op.cit. pg 148.

          Which is to say: If in the course of a (non-exploratory) investigation you find yourself facing a multiple comparisons problem, forking paths, etc. then you likely came up with a bad research design.

    • Yes, exactly. The forking path problem *is* ubiquitous, so why single out Tracy and Beall for public shaming? Their paper is hardly an egregious example, and might not be an example of this problem at all. Gelman and Loken present no evidence that Tracy and Beall made any analytic decisions contingent on their data. It would have been trivial to find papers, such as most of mine, that *admit* to it, e.g., “there was no significant effect, but after controlling for age, sex, ethnicity, whatever, there was a strong positive correlation (p < .01)" etc. etc.

      This leads me to suspect that Rahul is correct: Gelman and Loken are applying this criticism to papers whose conclusions they don't like.

      • Ed:

        No shaming going on here, and no singling out. Indeed, if you click through to our paper, you’ll see a discussion of several papers that have similar multiple comparisons problems. I’ve also discussed many many other examples on the blog. Again, no shaming. I think Beall and Tracy made some mistakes. These are mistakes that others make too. I’ve made other mistakes, we all make mistakes.

        If the identifying of a mistake is thought of as “shaming,” that’s, well, that’s a shame, indeed that’s the kind of attitude that I suspect can lead to a lot of defensiveness. People sometimes point out my mistakes. I don’t say they’re shaming me, I go and look and sometimes disagree and sometimes go back and make serious corrections. I don’t see Tracy and Beall saying that I’m shaming them. My impression is that they see Loken and me as disagreeing with them. They disagree with our disagreement, which is fair enough. They did feel that we’d mischaracterized some of their work and so I edited the post (see P.P.P.S. above) to address that.

        Anyway . . . in my consideration of Beall and Tracy’s paper, I realized an important aspect of confusion that I think is poorly understood. The key point is that multiple comparisons can occur without “cheating” or “fishing.” A researcher can do a unique analysis on a particular data set, but had the data been different, a different analysis could’ve been done. When we discussed multiple comparisons, one thing some people said was: Hey, that’s not me! I only did a single analysis on these data! And the point of this paper is for Eric and me to say: We’re not questioning your integrity! We’re not saying you fished for statistical significance! We’re saying that, had your data been different, you well could’ve done a different analysis. (A point that is borne out in the Beall and Tracy case because, they did do a new study with different data, and they did do a different analysis there, looking at an interaction.) I don’t think this sort of data-contingent analysis is necessarily bad science, but it does invalidate p-values. We’re making a statistical point, and we used many examples to illustrate it. The examples came somewhat haphazardly: people pointed them out to me, often after they’d received lots of publicity.

        • What stage of careers are Beall and Tracy in? Just curious. Often that can influence how one takes criticism.

        • Associate Professor and graduate student, I believe. If I’m not mistaken the graduate student once moonlighted as a contestant on Big Brother…

        • Argh! I don’t believe there is any confusion here – I believe you are not describing what you claim. The example you give with the followup study by Tracy and Beall was NOT a single analysis. The second analysis followed the original analysis explicitly because of a lack of significance. It WAS “fishing” and the authors acknowledged that. Which is why they designed a followup study to confirm the hypothesis that had been generated by this “fishing”.

        • Anon:

          Eric and I wrote our article before Beall and Tracy’s second paper. We were discussing the many ways (most obviously, in the rules for data exclusion) where their analysis was contingent on data. I was just mentioning their second paper as another example of this.

  14. Regarding the fertility window used by Beall and Tracy. Here is some post hoc reasoning (with the huge caveats that I know next to nothing about cycling and fertility, nor have I digested any of these papers). According to figure 2 in this paper:

    female swellings in chimpanzees seem to mostly lead (not lag) the day of ovulation. Since reddish shirts are hypothesized to be the analogs of chimpanzee swellings, perhaps they would be worn mostly prior to ovulation. Given their hypothesized role in advertising fertility and attracting mates, and that mating doesn’t happen instantly, it would make sense that they would be worn more often in the days leading up to ovulation than in the days afterward.

    • Ed:

      1. I’m not saying this reasoning is wrong, but if Beall and Tracy want to go this route, they’d have to back off a bit from claiming that the specific form of their hypothesis is purely theoretically motivated, since in their paper they just talked about “peak fertility.”

      2. Also, the 10-17 day window is for peak fertility which already is mostly before the expected day of ovulation. Just for example, this ovulation calculator gives day 15 as the expected date of ovulation (given a 28-day cycle) and days 12-17 as top fertility times. So the dates 10-17 are mostly pre-ovulation (in expectation). I don’t see how you could stretch this to days 6-7, but I guess it would be possible to construct a reasonable theory to do it. I wouldn’t recommend constructing such a theory based on the Beall and Tracy data, though: the measurements are so noisy and sample size so small that I don’t think you’ll get much from this.

      Again, let me return to the point that I think many researchers have a naive attitude (abetted, I’m afraid, by the mini-success stories that appear in statistics texts, including my own) to vastly overestimate their chance of discovering an enduring scientific truth from a simple low-cost survey or experiment.

      • For arguments sake, say researchers had a more gloomy outlook about their low-cost studies it still shouldn’t alter the fundamental truth of whether they have discovered an enduring scientific truth or not?

        I think the attitude, optimism & other meta traits ought to be orthogonal to whatever tests we use to vet a discovery.

        My problem with this whole discussion is it is often way too fuzzy. We are focusing too much on imputed motivations, assumed attitudes, personalities etc. without as much clear statement of what makes a result acceptable and what not.

    • It may be post hoc reasoning for you, but from the reading I did yesterday it’s well established. The Wilcox et al article that Nadia cited seems to be the standard reference and it represented a change from the earlier ad hoc methods.

    • If you comment and it does not appear right away, it was caught by the spam filter. Please try again. I think it helps if you supply the same name and address as in earlier, successful, comments.

  15. I like to think of this as a test question, and I doubt that many reviewers would have gotten it fully wrong.

    Tracy and Beall set out to understand if there is a relationship between displaying red colors and fertility. They know that female great apes’ skin becomes more red during their fertile days, and theorize that human beings follow this natural pattern in their clothing choices. They create a research design, tracking the clothing choices of about 150 women, and end up with a dataset that indicates day of cycle and the color of clothing that women wore.

    They assume some distributed response among the women to their hormonal changes, bound to their subconscious knowledge of their fertility. They calculate the number of days from peak fertility, supposing that propensity to wear red is normally distributed around peak fertility, and check the odds of wearing red based on the absolute value of distance from the day of peak fertility. This relationship is not statistically significant (but the coefficient is the right sign! and it’s, perhaps, “marginally” significant at p=.11)

    “Well,” Dr. Tracy says, “my measurement is a bit noisy–after all, perhaps the impetus to wear red is quite small, comparative to what mood a subject is in,” recognizing that our tools are not always perfect. The researchers decide to lump the days of peak fertility together, because perhaps the choice structure distribution is gaussian rather than normal across fertile days–they want to see if on any day that sex can result in conception, a woman would be more likely to wear red. They look at some papers from the subfield, and those articles appear to say that days 6-14 are fertile days. They run the regression and find that the result is similar in magnitude, and now even closer to their chosen alpha level (p<.08! Tantalizingly close!). They look at the data and think: well, red and pink really are the same color when it comes down to it, and decide that it is eminently reasonable to include pink. The results come in: statistical significance!

    Can they reasonably conclude that in the population at large, "women [are] more likely to wear red or pink at peak fertility?" Why or why not?

    Extra Credit: The authors repeat the study, but this time find no significant association between fertility and wearing red. However, when they correct for weather, they find that during February and March, women who are in day 6-14 are more likely to wear red. Does this replication provide evidence for a "cold-weather effect" explained by women's ability to signal fertility though scanty clothing during the warm months (and their resorting to wearing red when fertile only during February and March)?

    Extra Extra Credit: Come up with an abstract for a theoretically-motivated alternative explanation for the observed data. The best abstract will be submitted to 4 upcoming conferences in the field, acceptance guarantees an automatic A in the course.

    • I’ve posted a few times now on this post (many but not all of the Anonymous comments are mine). The post above perfectly illustrates what has bothered me so much about it and so I feel compelled to reply one last time.

      1. I agree in principle with everything Gelman says about the problems of “forking”. In fact, the post above suggests that most reviewers would recognize the problem with forking vis-a-vis p-values and inference. I agree with this interpretation.

      2. At least in my discipline (Ecology), the dangers of forking are well known. Anyone doing the sort of analysis described by Robert above and Andrew in his post *would* be accused of cheating or fishing. The claim by Andrew that he is not accusing the authors of cheating or fishing seems hollow at best. If you know you shouldn’t be doing something and you do it anyway – what else can you call it.

      3. I have read the paper by Tracy and Beal and I can say unequivocally that there is absolutely no evidence suggesting the authors did anything like what Andrew implies or Robert explicitly outlines in his post. The part described in the Extra Credit perfectly illustrates this since they used that analysis to generate a hypothesis (not to make inference) which they then tested with a followup study. In fact the available evidence suggests the authors have done everything they should have done!

      4. If Andrew and Robert would like to frame this discussion in hypothetical grounds fine. But to accuse the authors of something they have no evidence for strike me as libelous – especially in the context of point 2 above. A researcher’s reputation is the only thing they have – lose that and all your published work goes with it.

      I’ve always read this blog closely and have recommended it to others. I agree fully with the principles articulated in this post. But dragging these authors names through the mud based on assertions of what could have happened had the data looked different is inexcusable.

      • And I’m another anonymous commenter and I agree. There is no basis for attacking these researchers in such a personal and accusatory manner, not only on the topics of most of the discussion here but also on whether they know the literature of their field.

      • Anon- I don’t believe the process I outlined to be “fishing” in any way! That’s probably how I would analyze the dataset if it were my own experiment. In fact, I think that those steps would have been regarded as reasonable in a pre-registration of the study in question (test for different distributions of response, examine red and pink). The issue is not one of ethics, but of statistical understanding; the data and research design do not provide conclusive (or really even reasonable) evidence to affirm the hypothesis given the likely size of the effect and the noise involved across subjects. The authors told us that they did test for red only, and found the association not to be significant. I guessed that they also tested for a direct relationship between distance from fertility and wearing red, but they may not have. I’m not suggesting anything other than straightforward, honest, and reasonable data analysis. It’s simply that the results discovered do not provide anything like conclusive evidence of an answer to the stated research question.

        Even if the steps above had been outlined in the pre-registration of the study, would the effect observed be enough to affirm the conclusion of the paper? If the researchers had performed only one analysis (and hadn’t even seen that when checking red alone, the difference was not significant), would those results be enough to make the claim that women are three times more likely to wear red or pink when fertile? Does frequentist statistical testing allow that claim to be validated based on the data collected?

        The extra extra credit question is an indictment of the general problem here, not an accusation: these data do not exclude a multitude of alternative hypotheses. In fact, it could be that women are more likely to wear red or pink during the week after their period (day 5-12, let’s say), because they have recently seen blood. I lament that people keep assuming that this is an accusation of an ethical violation rather than one of misunderstanding of statistics even at the high level of a top journal in a field that is generally regarded as understanding statistics well.

        • “The issue is not one of ethics, but of statistical understanding; the data and research design do not provide conclusive (or really even reasonable) evidence to affirm the hypothesis given the likely size of the effect and the noise involved across subjects.”

          Wow – talk about shifting goal posts. I though the issue was one of forking – now it’s about effect size that you somehow know a priori are “likely” small.

          I don’t have any evidence that the authors did tell you they first tested for red but 1) this certainly can’t be true of the initial paper by Gelman because the authors posted on a blog that Gelman had NOT contacted them and it is not in their Methods and 2) it is common practice in my field to back up such statements with something like pers. comm. as a citaition.

          I prefer to be generally right rather than precisely wrong and I think in this case too the point of the paper is that there is an increase in the likelihood that women wear shades of red during periods of high conception risk (i.e. odds >1 vs. odds = 3). Finally the fact that the data are consistent with a multitude of explanations is true for any design, experiment or statistical approach – there are even the unknown ones that Donal Rumsfeld can tell you all about.

        • I don’t really think this argument is getting anywhere, and I’m not a big fan of conversing with Anonymous on the Internet, so I’m going to just leave it here.

          I think that the evidence presented is not that strong (and that Gelman and Loken do an admirable job demonstrating the ways in which the p-vales are invalidated). I’m not trying to drag anyone’s name through the mud. I misremembered and thought that the authors had written in the article that they had tested red alone, but that was just bad memory–I read the article several months ago. Whether they did or didn’t doesn’t affect the problem (this is, in my view, the entire point of the Gelman/Loken article). I hope that many more “large, well-powered studies” (Gelman and Loken, forthcoming) take place, and we get to the bottom of this important question, eventually.

          Works cited:

          Gelman, Andrew and Eric Loken. 2013. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Unpublished. Online at

        • If the red/pink issue isn’t important, why does it repeatedly get raised in this discussion – including in the original paper. In fact it *does* matter because it was one of the specific criticisms raised in the Gelman paper. And to casually assert that the authors progressively chose which colors to use in a search for significance – even responding to my post incorrectly asserting that the authors specifically told this to Gelman – is again what bothers me so much about this discussion. Because regardless of how you choose to frame its importance for Gelman’s argument now, the red/pink scenario that you and Gelman did describe is both without evidence and likely to undermine these authors research credibility.

        • Anon:

          Here is what Eric and I wrote:

          Beall and Tracy had several inconsistencies in their rules for which data to use in their analysis. For their second sample, 9 of the 24 women didn’t meet the inclusion criteria of being more than five days away from onset of menses, but they were included anyway. 22% of their first sample also didn’t meet that criterion but were included anyway. And even though the first sample was supposed to be restricted to women younger than 40, the ages of the wome included ranged up to 47. Out of all the women who participated across the two samples, 31% were excluded for not providing sufficient precision and confidence in their answers (but sufficient precision will vary with time since menses; it is easier to be certain ±1 day for something that occurred 5 days ago as opposed to 22 days ago).

          In addition, the authors found a statistically significant pattern after combining red and pink, but had they found it only for red, or only for pink, this would have fit their theories too. In their words: “The theory we were testing is based on the idea that red and shades of red (such as the pinkish swellings seen in ovulating chimpanzees, or the pinkish skin tone observed in attractive and healthy human faces) are associated with sexual interest and attractiveness.” Had their data popped out with a statistically significant difference on pink and not on red, that would have been news too. And suppose that white and gray had come up as the more frequent colors? One could easily argue that more bland colors serve to highlight the pink colors of a (European-colored) face.

          Regarding evidence, Beall and Tracy reported significance tests which are valid only under the hypothesis that they would’ve performed and reported the identical analyses had the data been different. They are the ones making the affirmative claim and it is up to them to demonstrate the plausibility of this claim. They did not pre-register their data coding and data analysis choices, and various alternative analyses would also be consistent with their research hypotheses. As Eric and I wrote in our paper, their analysis (as well as that of Bem, etc.) can be contingent on data without their doing any fishing etc on their particular dataset.

          Framing this in terms of “accusation,” “cheating,” “libel,” etc. does not help (see my PPPS above)—but it does reveal that you and your namesake are not getting the point of my article with Eric. Analysis contingent on data is not “cheating” etc., it’s just what researchers (including me) do when they don’t pre-register. I don’t think there’s anything wrong with analysis contingent on data. The mistake is reporting p-values that don’t recognize this problem, and going on to make general claims about the population based on noisy fluctuations in a sample.

        • Oops. I was focused on the red / pink issue for which I couldn’t find anything to support the claim. I can see now that the authors did manipulate the data in a way inconsistent with their methods (esp. footnote 1) and so while it’s not cheating (they tell us about it in the footnotes), it’s sloppy. I am surprised the reviewers let it through. Perhaps the reviewers missed it, like me, because the information was in the footnotes (we don’t use footnotes in my field and I have a bad habit of ignoring them).

      • >to accuse the authors of something they have no evidence for strike me as libelous

        Wow, is there actually going to be a legal angle in this after all? (cf. my previous comment above)

        Andrew, I’ve gone carefully through your “forking paths” PDF. If I were reviewing, I’d have to press you hard on the same point that I raised last fall when you first mentioned Tracy & Beall’s work, and started talking about “researcher degrees of freedom” — namely, where are the specific calculations for those DOF’s? Where is the graphic showing how the paths fork?

        It’s one thing to name-check Feynmann & his sum over all possible histories, but if you’re not supplying your own integrals, I’m sorry, but IMHO you’re just hand-waving. Per Chekov–'s_gun

        –if you’re going to mention “forking paths” in the title of your *mathematical statistics* paper, it shouldn’t just be a metaphor! You should be providing a quantitative model that can be directly applied to your three examples, with graphs as well, showing precisely how things spin out of control quickly.

        This is what David Bailey et. al did rigorously in their recent paper, which I sent you a few weeks ago..

        When you name-and-shame people as you have in GOFP, your case simply has to be air-tight, rock-solid & black-and-white.

        • I got waylaid by the text parser.. I put “joking!” in angle brackets after my first sentence, but it got stripped out! (Is there an escape character here?) My comment was supposed to convey gentle irony:

          Wow, is there actually going to be a legal angle in this after all? (cf. my previous comment above) ((joking!))

        • Brad:

          1. In our article, Eric and I discuss several possible forking paths in the Beall and Tracy paper and in the three other papers we discuss there.

          2. We are engaging in scientific criticism, which is a central part of science. With regard to “naming,” it would be odd to make the criticism without pointing to the papers being criticized, thus the names of the authors cannot be a secret. With regard to “shaming,” we’re not trying to shame anyone. We’re saying that Beall and Tracy, along with Bem and the others we mention, have made some mistakes. It is not shameful to make a mistake, indeed the key point of our paper is that it is easy to make that particular mistake without realizing it (which is why we prefer the term “forking paths” to “fishing” and “p-hacking,” which to me implies intent).

          3. I disagree with your statement that our analysis needs to be “air-tight, rock-solid & black-and-white.” It is Beall and Tracy, and the others who are making strong statements based on highly noisy data. As Eric and I wrote, “without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.” But this p-value is based on a strong assumption for which Beall and Tracy, Bem, etc. provide no evidence: an assumption that the data processing and analysis would be the same had the data been different. It is their duty to make this case; all we are doing is pointing out that (a) their conclusions require this assumption of a pre-chosen analysis, and (b) for various reasons (for a general discussion, see item 2 on page 12) this assumption does not seem to us to be well-founded.

          They’re the one publishing a strong claim about human nature based on weak evidence; I think it’s their duty to justify the assumptions that support their strong claim. I really really really object to the idea that a criticism of a published paper should be held to higher standards than the paper itself. This to me represents the idea of privileging the publication process that has led to so many problems in science understanding and science reporting.

        • > really really really object to the idea that a criticism of a published paper should be held to higher standards than the paper itself

          One would think that should not to be hard to get across but …

          From David Bailey et. al. referred to above

          “We would feel sufficiently rewarded in our efforts if this paper succeeds in drawing the attention of the mathematical community to the widespread proliferation of journal publications, many of them claiming profitable investment strategies on the sole basis of in-sample performance.”

          If in 2014 a scholarly article can be published (and applauded) whose primary objective is to get across a message that almost always was in any introductory statistics course since the mid 1970,s – certain types of criticism (you can’t reliably learn using just in-sample performance) – these “you can’t get what you want so easily” lessons are very very very hard to get across.

        • Andrew, thank you for your response.

          >With regard to “shaming,” we’re not trying to shame anyone.

          Perhaps not, but as you say in the paper, people sometimes do things they don’t intend to..

          >I really really really object to the idea that a criticism of a published paper should be held to higher standards than the paper itself.

          But you’re not just criticizing a single paper. You’re criticizing a whole school of scientific thought & practice. With respect, your objection doesn’t seem well-founded. If established beliefs within a community could be so relatively easily discredited, wouldn’t that be chaotic? I know you like to quote the classic dictum on “extraordinary claims”. Keep in mind though, those you criticize may be applying this dictum themselves to *your* claims! To the extent their positions represent an accepted status quo, the fair onus is on you to go that extra mile. I think you have it in you, you can do it! :)

        • Brad:

          I don’t know if this helps, but I’m not the only one criticizing this sort of work. Look up some of the literature cited on this blog, for example the work of Nosek, Button, Ioannidis, Simonsohn, etc. And of course my own work is available for criticism; indeed, I many times have recognized valuable criticism of my published and unpublished work.

          Finally, let me repeat that the work of Bem, Beall and Tracy, etc., does not stand on its own. It all relies heavily on statistics. These people are not Jean Piaget, doing one-on-one psychology with careful observations of individuals. They’re putting together noisy datasets, finding statistical significance, and declaring victory. If they wanted to operate entirely outside the world of statistics, Uri Geller style, that would be one thing. But that’s not what they are doing. To extent that this sort of work represents “established beliefs within a community,” that is indeed chaotic. But that’s not my fault, it’s theirs. In particular, it’s their fault for having a naive view that sloppy data collection can routinely yield enduring insights about human nature. I’m not the only person who thinks that this approach to science is ridiculous.

          Finally, you write: “I know you like to quote the classic dictum on ‘extraordinary claims.’” I don’t know that I’ve ever in my life quoted this dictum (I did a quick search and didn’t find the phrase on this blog), so I don’t know what you’re talking about.

  16. In fact, I’d like to amend the question to imagine that even before beginning, the authors knew that they were going to use whether a certain day was a fertile day as a binary variable to look for a uniformly distributed response during fertile days, and knew that they would include both red and pink. It doesn’t change the answer to the question.

  17. I’d love Andrew / Eric to post some more examples of published studies of this sort:

    (a) Using Bayesian methods (no p-values) yet guilty of the forking paths pathology


    (b) Using p-values yet not guilty of the forking paths pathology

    That might clarify some of their arguments more.

  18. Andrew, thank you for your further perspective and recommended reading.

    >the classic dictum on ‘extraordinary claims.’” I don’t know that I’ve ever in my life quoted this dictum..

    Yes, oops, I’m so embarrassed. A few weeks ago, I started composing a comment on a different posting of yours. My comment hinged on that quote. Before committing, I thought to check myself and see how “original” my perspective was in the context of your blog. So I googled “Andrew gelman extraordinary claims”.

    As you can see, the first three results indicate “posted by Andrew”. I was too lazy to click though to verify, rather I totally ran with my confirmation bias. I was remembering my superficial, incorrect conclusion in this latest dialog. Google should do a better job parsing out blog comments for attribution, but I shouldn’t have blindly trusted them. Apologies for my confusion.

  19. Pingback: The inclination to deny all variation - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

Leave a Reply

Your email address will not be published. Required fields are marked *