Skip to content

When you believe in things that you don’t understand

This would make Karl Popper cry. And, at the very end:

The present results indicate that under certain, theoretically predictable circumstances, female ovulation—long assumed to be hidden—is in fact associated with a distinct, objectively observable behavioral display.

This statement is correct—if you interpret the word “predictable” to mean “predictable after looking at your data.”

P.S. I’d like to say that April 15 is a good day for this posting because your tax dollars went toward supporting this research. But actually it was supported by the Social Sciences Research Council of Canada, and I assume they do their taxes on their own schedule.

P.P.S. In preemptive response to people who think I’m being mean by picking on these researchers, let me just say: Nobody forced them to publish these articles. If you put your ideas out there, you have to be ready for criticism.


  1. RJB says:

    Maybe I skimmed too quickly, but it looks like they ran one study on Amazon Mechanical Turk, used a researcher degree of freedom to split by weather when they didn’t find the overall result, and found something consistent with a reasonable explanation: women go red when it’s too cold to go skimpy. If they stopped there, I can see why you’d claim “data snooping” and call it a day.

    But they followed up with a controlled experiment (collecting some data on a cold day and some on a warm day) that replicated the weather-by-ovulation interaction . Sure, you could ask for a larger sample size and more sophisticated statistics (and from your Slate article, a better definition of ovulation), but I assume by your focus on “predictability” your primary criticism is that the paper is post hoc data snooping.

    Doesn’t the second experiment address that concern to soothe Karl Popper just a little (ignoring that they designed a study to test for support, not falsification)?

    • question says:

      I don’t see how Popper would be soothed. There is nothing to be falsified other than “one group of ovulating women wear red/pink shirts exactly the same percent of the time as a different group non-ovulating”, which appears to be a strawman. Also, shouldn’t something in that chart sum to 100%? I couldn’t make sense of what it was showing.

      • Nick Menzies says:

        Regarding the graph: it shows percent ovulating for 4 different groups (red vs. not red crossed with warm vs. cold), so okay not to sum to 100%. I actually dislike it when I see both halves of a binary classification reported/graphed — there is the let down when you realize you are getting 1/2 as much information as you thought.

        What I find strange about the graph is that the error bars represent the std error of the mean, if you read the fine print. Is this common practice in this field? If I see error bars I immediately think it is some kind of 95% interval. If their approach is not standard in the field, this opens up a nice new avenue for lying with statistics, or statistical graphics anyway.

        • question says:


          It is often as childish as choosing the narrowest arrow bars made available by your graph software (which will be SEM). There has been some research concluding that people do not really bother to make the distinction anyway.

          “We conclude that very many researchers whose articles
          have appeared in leading journals in psychology, behavioral
          neuroscience, and medicine have fundamental and severe
          misconceptions about how CIs and SE bars can justifiably
          be used to support inferences from data.”

  2. Dwayne Woods says:

    Don’t fret. Be mean and come clean. As you said, if they put it out there, then expect to be confronted.

  3. Rahul says:

    I’ll try extending the protocol: Next step is to replicate with another cold weather study. If unfortunately we find that ovulating women were no more likely to wear red/pink, well, no problem. Now we split the data set into married & unmarried. Aha! Effect persists in the unmarried ones.

    Replicate with a cold weather study using only unmarried females. Again, no go. Don’t give up. Split them into pretty & non-pretty women (all self identified, of course). Aha! There it appears again. Robust effect indeed.

    Eventually, we might conclude: Pretty, unmarried, vegan, athletic women who vote Democrat are more likely to wear pink while ovulating. Only in cold weather, of course.

    • Anonymous says:

      Here is my irrefutable law: Ovulating women wear red, except when they don’t.

      • Andrew says:

        And on day 7 of their cycle, they’re probably not ovulating.

        • Anonymous says:


          Using the word “probably” hurts my theoretical ego. I only deal in irrefutables.

        • Dan Wright says:

          But they would argue that if a woman on Day 7, who was wearing red, got classified as ovulating when if fact she was not ovulating, this would lower the size for the effect. Well, at least according to their reply to you from the Psych Sci paper on

          “Furthermore, if our categorization did result in some women being mis-categorized as low-risk when in fact they were high risk, or vice-versa, this would increase error and decrease the size of any effects found.”

          At least this is how I read this. If they had meant the expected effect goes down with random error they would have said this.

          • Andrew says:


            Yes, they argued that but they are mistaken in the implications of this fact. Eric Loken and I are writing a paper about this. The short answer is that misclassifications lower the expected effect size which in turn makes it less likely that any comparison that happens to be statistically significant is actually telling us anything about reality in the larger population. In short, the probabilities of Type S and Type M errors become very high. Their problem is that they are taking statistical significance as a signal to take their effect as true.

            To put it another way, they found that in their data using their definitions that women in those day-of-cycle categories were three times more likely to wear red or pink shirts. Do they or anyone else believe that if they’d defined the day-of-ovulation categories correctly, that they’d have found even larger effects? No. What they are doing is capitalizing on chance, and each source of measurement error just disconnects that chance one more step from the underlying phenomenon of interest.

            • Rahul says:

              Say I have a y vs x correlation. Now I find measurements of x & y both were very noisy but with uncorrelated errors. Doesn’t that mean the true correlation is actually stronger than what was measured?

              Isn’t what you are describing similar? If so, why are they wrong?

              • Andrew says:


                Your intuition would be correct here if you had observed the correlation between these measurements in the population. The problem is that, in the study in question, the researchers observed the correlations in a small sample. The presence of large measurement error makes it less likely that these correlations in the sample correspond to anything relevant in the population.

              • Rahul says:


                Ah! I see what you mean (I think).


            • question says:

              Something like this?

              p<-matrix(nrow=10000, ncol=2)
              eff<-matrix(nrow=10000, ncol=2)
              for(i in 1:10000){
              #Without error
              p[i,1]<-t.test(x,y, var.equal=T)$p.value

              #With error
              p[i,2]<-t.test(x,y, var.equal=T)$p.value

              length(which(p[,1]<0.05))/nrow(p) # Significant w/o error
              length(which(p[,2]<0.05))/nrow(p) # Significant w/ error

              hist(eff[which(p[,1]<0.05),1], xlab="Mean(x)-Mean(y)", main="Sig Results No Measurement Error")
              hist(eff[which(p[,1]<0.05),2], xlab="Mean(x)-Mean(y)", main="Sig Results With Measurement Error")

      • question says:

        You may be surprised to find out that, according to the common statistical logic, just because you reject the hypothesis that an ovulating women is wearing red, this does not mean you can accept the hypothesis they are not wearing red.

        • Anonymous says:

          @ question:

          This is why my laws are deterministic and irrefutable.

          Trust me, I’ve seen it time and again. All you have to do is torture the data enough, and it will tell you what you knew all along to be the truth.

          Statistics is for losers.

          • question says:


            H0 == A == Ovulating woman is wearing a red shirt
            H1 == B == Not A == Ovulating woman is NOT wearing a red shirt
            Transposing the Conditional: The probability A is true given B != the probability B is true given A.

            Conclusion: Therefore, if we reject H0, this does not imply H1 is true. We would need to know the prior probabilities of H0 and H1 to say.

            • question says:

              Addendum: If we fail to reject H0, this does not imply H0 is true. However, it does indicate there is insufficient evidence to claim H1 is true.

            • Anonymous says:


              You are being pedantic.

              My prior that A is true is 1.

              The likelihood that A is true is 1. Why? Because I know how to torture data such that it confesses my priors. Ergo A is true bc I say so.

              Mental onanism about NHST and Popperian stuff won’t get you anywhere. Start torturing the data.


              • question says:

                You’re right. I forgot about the difference between theory and practice. In theory, scientists use objective methods to interpret their data. In practice, they measure their collective opinion.

      • Daniel says:

        I refute your law by an counterexample of an ovulating women who wears a red skirt while she does not wear a red blouse. Formulating really irrefutable laws is difficult, Anon.

  4. Noah Motion says:

    We should expect no such interaction in cultures where there is never an option to wear less clothing, right? But then, maybe in such cultures, red clothing is also more rare. If so, now we have an explanation!

  5. Eric Loken says:

    Oh my goodness. I really would not have expected them to double down like this. It’s difficult to know where to begin because their will to believe is so resolute, and yet so little self-awareness. I guess we could start at the beginning with the data. In the Psych Science article it was just barely possible to reconstruct the actual 2 by 2 tables on which they carried out their calculations. This time it’s more difficult as their experiment has a 3-way structure and they are not clear about the margins. But go figure, the distribution of 22 cases of wearing red, in the four relevant cells of their N = 208 table, has once again confirmed their discovery and also shown an important boundary condition. I wonder if they know the boundary condition on their findings? They keep powering their studies so that the 3:1 effect size just ducks under p < .05. They are at best 2 for 4 by my count (I discount the first N = 25 replication), but even that's pending a look at the data.

    • Dean Eckles says:

      Yeah, as far as I can tell, it’s not actually clear that the simple contrast of conception risk vs non-CR women for the cold day is even significant. (This is hard to figure out, since as you say, we don’t get the actual contingency table. But for most possible values of number of CR women on the cold day such that are Y values that round to 18% and 5%, this comparison is not going to be significant.) Rather, the interaction is significant because the difference for the warm day is in the other direction.

  6. Entsophy says:

    Papers like this remind me that academia needs a shake up. A wild-west period where everything is uncertain and respect is won not given out by committees (thesis or tenure). I saw this by Freeman Dyson today. Notice the part about WWII:

    Interviewer: You became a professor at Cornell without ever having received a Ph.D. You seem almost proud of that fact.

    Dyson: Oh, yes. I’m very proud of not having a Ph.D. I think the Ph.D. system is an abomination. It was invented as a system for educating German professors in the 19th century, and it works well under those conditions. It’s good for a very small number of people who are going to spend their lives being professors. But it has become now a kind of union card that you have to have in order to have a job, whether it’s being a professor or other things, and it’s quite inappropriate for that. It forces people to waste years and years of their lives sort of pretending to do research for which they’re not at all well-suited. In the end, they have this piece of paper which says they’re qualified, but it really doesn’t mean anything. The Ph.D. takes far too long and discourages women from becoming scientists, which I consider a great tragedy. So I have opposed it all my life without any success at all.

    I was lucky because I got educated in World War II and everything was screwed up so that I could get through without a Ph.D. and finish up as a professor. Now that’s quite impossible. So, I’m very proud that I don’t have a Ph.D. and I raised six children and none of them has a Ph.D., so that’s my contribution.

  7. Rahul says:

    In general, how good are women in remembering accurately the date of “the first day of their last period of menses” when asked about it casually in a survey, without warning? Is this something women tend to remember? Just curious.

    In the context of this study even a one-day error would be huge, right?

    • Andrew says:


      The standard advice is that days 10-17 are most fertile, so based on day alone, a 1-day error is not such a big deal. But really the whole thing is hopeless. When studying effects that are so tiny and so variable, you need much more precision in every way, and it’s a fatal mistake to try study this via a between-person design. On the other hand, if a researcher just wants to come up with an endless string of statistically significant findings that can be taken to confirm a vaguely specified theory, the design they’re using seems to work just fine.

      • Rahul says:

        If you wanted to design a study from scratch to verify this assertion how would you do it?

        • Andrew says:

          Within-person design (that is, interview people repeatedly over a series of months), measure ovulation as accurately as possible, look at other outcomes too, not just what shirt you’re wearing. Maybe preregister your analysis and research hypothesis too. One of the difficulties is that the hypothesis is so vague. Maybe it would make sense to formally have a two-stage design where the first stage is exploratory and the researchers use the analyses from these data to formulate a specific hypothesis, then they could test in the second stage. Think seriously about effect size and do a corresponding design analysis. It’s not easy. But it is easy to do the sort of study that they did do, and as long as such studies get published in top journals you can see where the incentives go.

        • Anonymous says:

          Here is a test. Go out to a disco or wherever it is young people go to hook up these days.

          Identify a women wearing red, and those not wearing red.

          Toss a coin and, depending on the outcome, try to seduce the woman wearing red or not.

          Repeat. You may want to control for the rounds of drinks.

          According to the theory you may have more success with those wearing red.

          1. The more attractive you are the less power you’ll need.
          2. Ladies: Apologies if this comment sound sexist but had to time to word smith better.

        • Anonymous says:

          PS Another implication is that, evolutionary speaking, men ought to be more attracted to women in red.

          So before I am accused of being sexist, women can repeat above experiment. Just toss a coin to decide whether to wear red or not, and then tabulate the number of (fooled) men making a move. If you are very attractive you’ll have a lot of power even if the sample size is small.

        • D.O. says:

          At the very least, they might have tried to figure out whether women wear less clothing in warm climates when ovulating then when not ovulating. That’s their explanation for the effect after all.

  8. Rahul says:

    Out of any random sample of (young) women how likely is it that half (50%) are in the fertile window?

    Their data shows 51 women among 100 surveyed as ovulating (i.e. high conception risk)? My naive calculation says 32% would be the expected number. Isn’t 51% a bit too high? Or is that variability to be expected?

  9. David E says:

    Tax day in Canada isn’t until April 30th, I imagine a 95% CI around that date would cover the 15th! But given the sorry state of funding for the humanities and social sciences in Canada, maybe it’s better that we just keep this little piece of science out of the purview of elected officials looking for justification to cut what’s left.

Leave a Reply