Too Good To Be True: The Scientific Mass Production of Spurious Statistical Significance

Are women three times more likely to wear red or pink when they are most fertile? No, probably not. But here’s how hardworking researchers, prestigious scientific journals, and gullible journalists have been fooled into believing so.

The paper I’ll be talking about appeared online this month in Psychological Science, the flagship journal of the Association for Psychological Science, which represents the serious, research-focused (as opposed to therapeutic) end of the psychology profession. . . .

In focusing on this (literally) colorful example, I don’t mean to be singling out this particular research team for following what are, unfortunately, standard practices in experimental research. Indeed, that this article was published in a leading journal is evidence that its statistical methods were considered acceptable. Statistics textbooks do warn against multiple comparisons, but there is a tendency for researchers to consider any given comparison alone without considering it as one of an ensemble of potentially relevant responses to a research question. And then it is natural for sympathetic journal editors to publish a striking result without getting hung up on what might be viewed as nitpicking technicalities. Each person in this research chain is making a decision that seems scientifically reasonable, but the result is a sort of machine for producing and publicizing random patterns. . . .

Full story here.

38 thoughts on “Too Good To Be True: The Scientific Mass Production of Spurious Statistical Significance

  1. Correction: APS was *supposed* to be the serious, research-focused (as opposed to therapeutic) end of the psychology profession.

    Recently Psychological Science has been more akin to the New York Post.

  2. If the correct statistical procedure were rigorously followed for studies currently published in journals like these, almost no paper would be accepted for publication, and this includes work where researchers do not go overboard with their researcher degrees of freedom. I’m not making an excuse for this kind of research, just making an observation.

    This is perhaps why even scientists who know they should correct for multiple comparisons just go ahead and publish at alpha=0.05.

    What’s to be done? Would it be better if one fitted bayesian models and started reporting the probability (given the data) that the parameter in question is >0 or <0?

    • I don’t think better modeling alone will help. Essentially journal editors and universities need to take a step back from headline chasing with novelty and titillation. Unlikely to happen, I know, but there it is.

      • Editors and institutions (and perhaps scientists themselves?) setting up some new benchmarks/ a higher standard is about the only thing they can do for anyone to take them seriously. I mean, seriously look at all these issues in science today. It’s almost a satire. I wonder how many journal, institutions, and scientists are really scientific?

        I will pre-register the hypothesis (for a higly-powered study of course) that 20% of these parties aren’t really scientific at all and are just in it for the money or whatever, 20% are scientific and try and better science and think critically about things and try things to better science, and 60% are too lazy or ignorant to do something or really think for that matter.

        This pre-registered hypothesis would provide some data (be they significant or not: no file-drawer here) for “the theory of ‘science’ around the year 2000”, which posits that journals, institutions, and scientists around this point in history hold themselves to be so sophisticated that they seem to have forgotten what science is about, that there even could be some sort of responsibility involved in it (sort of like being a doctor), and that they collectively can take a few simple actions for everything to possibly be much, much better (or at least try and investitage this, which could basically be seen as them just doing their job).

        It seems so simple, but that’s probably not a very sophisticated thing to say, nor is it probably very responsible.

        • It’s not at all new (e.g. http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/ ) – it just become much more noticeable given current technology while methodologist have woken up to the seriousness of the problem.

          My favourite is JG Gardin who was funded to do AI in archeology and developed a pattern search program. Realizing how this could pollute scientific research in archeology he decided not to make it available. His funders forced him to and he published it in protest. In 1985 he mentioned at a meeting that he had tracked 17 Phds that had been granted in archeology to those who had run his program and colourfully interpreted the likely unreal patterns that had been found. My old business policy professor told me two years ago that this is exactly what junior people do in his field now – given “data mining”.

          As for the change, you used to have to walk to the library and read a print copy and then manually try to find similar papers in other journals (in other libraries perhaps). When did even a mention of how the way research data is generated and processed by communities and the challenges this creates in discerning real patterns (as for instance is address in _meta-analysis_) make into main stream statistical texts and education (post 2010)?

          As for the cahnge, you used to have to walk to the library and read a print copy and then manually try to find similar papers in other journals 9in other libraries perhaps). When did even a mention of how the way research data is generated and processed by communities and the challenges this creates(as is address in _meta-analysis_) make into main stream statistical texts and education (post 2010)?

  3. This was a fine write-up, but the implicit take-home message is rather nihilistic (don’t trust anything where there *could* be multiple comparisons). Putting aside when peak fertility occurs (a shame that this was messed up), how would you suggest that the authors analyze this data? What is the constructive side of the critique?

    • Jb:

      Setting aside the error on the fertility dates, and setting aside the nonrepresentativeness of the sample, I don’t think much can be learned from n=124 here. This project could be a fine term project for a class, but I don’t think we’re going to get much enduring knowledge out of it.

      As I’ve suggested before, I think the best place for such studies would be in a new journal, “Speculations in Psychological Science,” where researchers could present speculations and raw data without having to try to be conclusive.

      • Perhaps I misunderstand, but putting aside an unfortunate error and a (typical, but limiting) non-representative sample, your suggestion would be to collect more data? If so, how do the multiple comparisons fit in here?

        • Jb:

          In answer to your original question, “How would you suggest that the authors analyze this data?”, I’d suggest a hierarchical model, considering all the different outcomes that were measured. (I have no idea what was on the survey form, but I assume they asked about other clothing than shirts, for example.) But I don’t think any amount of modeling will yield much here, between the small sample size, the nonrepresentative sample, and the measurement problems.

        • But they might not have measured other things, so it is a bit of a stretch to make it a primary concern of the paper without having any evidence.

          Putting that aside, how do you then recommend people interpret political science papers that use datasets with huge numbers of variables like the ANES and GSS? I understand these data sets have better samples (large and representative of US) but I don’t quite see how this eliminates any kind of multiple comparison issue?

        • Shrinkage between estimates of the subgroups under consideration that may have an interesting effect (eg “people wearing red clothing”).

          The more variables one includes, the more precise one can estimate the distribution of outcomes across subgroups. The prior distribution across subgroups shrinks the estimates and attenuates the population contrasts compared to doing independent ML estimates for each subgroup.

        • jb: You can’t really put aside errors in facts and the use of samples of convenience. This is a shakey foundation that heightens concerns about multiple comparisons, etc.

          I don’t think that Andrew is randomly picking a survey and then guessing all of the possible ways that multiple comparison could occur — the nihilist option. Rather, I think he’s looking at series of poor choices, a lack of clear documentation, and sensationalist findings in a particular study and quite reasonably saying that (undocumented) multiple comparisons also seem quite likely.

          The answer is to gather a reasonable sample, get your facts straight, clearly document your instrument, and then publish. OR to not take your study seriously, as Andrew suggested, but to offer it as something fairly speculative. You’re asking him to specify the lipstick brand and color to put on this pig.

        • This is not quite what I am trying to do. I’m starting with two premises: 1. Most researchers in the social sciences cannot get representative samples to test their specific hypotheses and 2. most will probably not make the fertility error (most aren’t studying anything related to this). So, then we are primarily left with the multiple comparison issue.

          Given that multiple comparisons are not magically fixed by representative samples and clearly documented studies, I think it is fair to ask how a researcher would be expected to handle this issue (practical advice is a good thing afterall) and to figure out why this issue would apply to some psych science studies, but not political science studies (some of which we know have multiple possible comparisons).

          I like seeing this kind of data driven writing on Slate, I’m just trying to probe these issues further.

        • Andrew would have in mind some sort of hierarchical model (as he mentioned above); there are also bootstrap procedures that can address these problems; there’s a recent summary by Romano, Shaikh and Wolf (2008, Econometric Theory, http://ideas.repec.org/a/cup/etheor/v24y2008i02p404-447_08.html has a link to a working paper version; it probably also turns up in Google Scholar).

          These concerns are also what motivates a lot of the partial identification literature (i.e. Manski’s stuff: http://faculty.wcas.northwestern.edu/~cfm754/)

        • The issues that AG identifies in his excellent piece seem to me to be a big problem in many areas of psychology.

          “1. Most researchers in the social sciences cannot get representative samples to test their specific hypotheses.” Answer: work harder to get representative samples. Before complaining about the time and effort, look at those of your colleagues who are actually doing this. If you can’t do it yourself, collaborate. Use what you know to interest collaborators who can help you to answer your questions.

          2. “Most will probably not make the fertility error (most aren’t studying anything related to this.)” I review about a dozen psych articles a year for top-ranked psychology journals: plenty of shocking errors. Lots of publishing in this field, but I get the sense not much reading. Slow the f* down.

          3. “Multiple comparison issue.” Try harder to destroy your results. Explain your best attempts. If you can’t avoid small samples, use past knowledge and write Bayesian models. Don’t pretend those models are more than models. Embrace uncertainty.

          Finally: publish your best research in Frontiers or PLOS One. Let your work speak for itself, not the badge.

        • jb: “Most researchers in the social sciences cannot get representative samples to test their specific hypotheses”. Then they should become pundits and publish op ed pieces.

          It’s not worth worrying about multiple comparisons with poor data. That’s like worrying about whether one should use reduction or souring to when cooking manure.

        • If you can’t get representative samples that are large enough then you simply are not going to be able to come to any very convincing conclusions. No statistical hocus-pocus can solve the “garbage in, garbage out” problem. You of course _can_ go through the same analytical motions with a small, non-representative dataset as you would with a large, properly representative dataset, but what you are doing is “cargo-cult” science.

          http://en.wikipedia.org/wiki/Cargo_cult_science

        • To me most of the rest is being driven by the search for the sensationalist findings. Psychologists ARE taught about these things but the search for bling-bling has created a race to the bottom in terms of hygiene for the review system and for data analysis in the first place.

          Part of it is also driven by the sheer VOLUME of publication in some areas of psych. In a hardcore area like developmental or animal behavior it’s normal to publish one article a year. Those articles go through a TOUGH review process and it’s not unusual for reviewers to ask for an entirely new set of experiments to be run. Unsurprisingly those articles tend to be pretty good. (I think they are sometimes excessively conservative about some of their procedures.) Ditto in areas like psychometrics or mathematical psychology, where the expectations are generally much higher, the review process is similarly tough, and the system if filled but not overwhelmed.

          In other areas, expectations for publication are much looser and numbers are higher. New PhDs often have more articles than a senior faculty member in those other fields. There’s just no way those articles are being adequately reviewed. It truly can’t be done.

        • “Psychologists ARE taught about these things but the search….”

          Perhaps it would be interesting to really investigate this. Go to different universities and really look closely at the curriculum. Sure they will probably be taught how to analyse data using SPSS, but would they for instance also be taught about the possible implications concerning QRP’s, confirmatory vs. exploratory analyses, having adequate power, etc. ?

          I think that would make a very interesting research topic: investigate the curriculum of different universities, and assess the knowledge students have gathered (or should have gathered), and whether they adhere to this knowledge and if not: why not.

        • “Perhaps it would be interesting to really investigate this. Go to different universities and really look closely at the curriculum. Sure they will probably be taught how to analyse data using SPSS, but would they for instance also be taught about the possible implications concerning QRP’s, confirmatory vs. exploratory analyses, having adequate power, etc. ?”

          And after you have done that, you could read the following article (especially page 647), and compare things with “planet F345”.

          http://pps.sagepub.com/content/7/6/645.full.pdf+html

  4. It’s great to see a column like that on Slate!

    It would probably be too much work to do regularly, but a very nice trick would be to do the “correct” (well, more correct) statistical analysis for the study too—if not on the original data, then on representative randomly generated data. Walking the reader through some of the details even more (with graphs!) could make the point even stronger. Plus, it’s not something that many other columnists could consider.

  5. This underscores one of my pet peeves: that the magazine put out by ASA and RSS is called “Significance.” Fisher did a lot of nice things, but appropriating this word was not one of them.

  6. Andrew,
    I’m a big fan of your blog. A few thoughts on your critique:

    1) Does the sample need to be representative? Perhaps the online participants wear red 10% of the time and college students 50% of the time, but if each group experiences a similar increase in wearing red around the time of ovulation then it seems to support the hypothesis, despite any lack of representativeness.

    2) I think your “clincher point” downplays the role of theory. If the researchers had no theory to back up when wearing red is associated with ovulation, then it would seem that they were indeed mining the data for any possible finding. From an evolutionary perspective, it would make sense why women would bright, vibrant colors close to ovulation. What if the researchers had found that women are more likely to wear frumpy, gray colors close to ovulation? Even though this would be a finding based on researchers’ degrees of freedom, it seems unlikely that any journal would publish it because there is no theory to support why it is relevant.

    3. Finally, because of its inherent limitations social science research needs to be replicated numerous times before it is viewed more concretely. I think you are a little too quick shoot the article down, when it reality it should be seen as a first step (perhaps a small one considering the methods). The media will say that women WILL wear red closer to ovulation, but of course it is more accurate to say that research SUGGESTS that women MAY wear red closer to ovulation. If future studies back up this finding, then perhaps we take more definitive language. Sadly, if researchers don’t find this, journals will probably be less likely to publish it as a non-finding.

    • Phil:

      I agree with your first point. Testing a population level theory on a convenience or purposive sample can be informative for the population. This is the model in behavioral economics. It has internal validity but I would not trust the sample estimates as estimates of population parameters (causal magnitude). However, the evidence may increase my belief that causal direction will replicate over the population on average (causal direction).

      I completely disagree with your second point. If you carry out a rigorous experiment, register hypotheses in advance, and it turns out women dress in black on ovulation day then that is super informative. Presumably your theory is wrong. Instead you are asking the following: Can I torture the data enough to make it confess my prior? With enough degrees of freedom data _always_ confess so the procedure is 100% uninformative. (It might be somewhat informative if we could not get the data to confess no matter how much we tortured it. Indeed, the amount of torturing can be informative but is seldom reported)

      On your third point, the problem is replication does not take place, and is typically left unpublished. Instead it appears there’s a lot of “first steps”, data torturing, and media confessions going on. Good for careers, not so good for science.

      • “Good for careers, not so good for science.”

        This always makes me wonder about the scientists with these kinds of careers. I mean, do they not see the possible problems with publishing less rigorous studies, or what? If this is “normal” I wonder what that says about today’s scientists, the people hiring them, journals publishing this stuff, or maybe even the entire system?

        It just amazes me: why even bother with all of this, what’s the point? You might as well just make up a story (or data for that matter) and send it to a glossy magazine, maybe that would have the same “scientific” value. You can call your story “groundbreaking” and the editor of the magazine can say things like “our glossy magazines are self-correcting in the end”, so then it doesn’t matter to have some higher standards.

        I am thinking what I would think if I were a scientist doings things like that, or even being a part of a system like that. I don’t think I could say stuff like: “well, everybody does it so I have to” or “well, there is nothing I can do. I just have to follow the rules to keep a job”. If I were a scientist, I would cringe at the thought that I would “put something out there” which could be simply useless or false. I mean, what’s the point then even? Why then even write a fancy article? Why not just write a book or go and work at a grocery store?

        I think if I were a scientist, I would try and do everything I could to make sure that the information I put in my article would make sense and have some value or else I would have a hard time with my conscience. I wouldn’t be able to handle it when I had the idea that other scientists could waste money, and effort by building on my work or using it in some way or form when this information was not optimally gained and therefore possibly useful.

        Now, this particular article is just social psychology, but if I were a scientist doing cancer-related research for instance, I would even have a harder time doing things less than optimally. I would hate to imagine that this stuff (“career scientists”, “groundbreaking research”, “top tier journals”, etc.) also applies to institutions, journals, and scientists involved in, I don’t know, cancer-related research, or research into psychological disorders or important things like that. I would expect that those journals, institutions, and scientists would have a higher standard and a better system, and/or rules concerning that type of research.

        • One million or so journal articles published in health each year. We still don’t know why people get fat. Why?

          My hypothesis is running a prospective experiment is expensive, takes a long time, and presumably there is no profitable drug at the end of the line. Perhaps behavior change, and prevention only (unless you come up with Atkins type meals).

          So what do you do as an academic who has to publish 10 papers a year (hence the millions mentioned above)? You take observational data from the nurses study and run regressions. One after the other. Until you get the headline “A spoonful of flax seed a day reduces weight”.

          The data is not there to test but to back you up.

          PS Ok I’m being cynical but not too much.

        • “You can call your story “groundbreaking” and the editor of the magazine can say things like “our glossy magazines are self-correcting in the end”, so then it doesn’t matter to have some higher standards.”

          That’s the beauty of science !! These types of arguments can always be used, for nearly anything!! It’s really great once you get the hang of using it. Here are two great ones, you could use for nearly everything:

          1. “Higher standards will hinder exploratory research (or use a fancy word like “groundbreaking”)” & “Higher standards will hinder science evolving”

          (I always have a hard time grasping this. Why, and how then? If I were a scientist, I could have a nice test-study and “explore” as much as I want to. I could then re-do the experiment more rigorously (using pre-registration, more people, etc.) and publish my “exploratory” study and be all “groundbreaking” and “help evolve science” and stuff like that, or am I wrong? I mean, why not then lower standards (use p < .20 as "significant" or use a maximum of 10 people, etc.), because that way you could pusblish even more potentially "groundbreaking" and "exploratory" studies and help "evolve science". Come to think of it: why even gather any data and do any statistical analyses. Just write a "groundbreaking" story with only hypotehses. Done !! This will surely help "exploratory" research, and evolve science. In the end science will be self-correcting of course, so it really doesn't matter)

          2. "Higher standards will not 100% solve problem X, so why change" & "Nothing is perfect, so why change"

          (Also have a hard time grasping these types of arguments. It seems to me that nothing will solve everything, but shouldn't it be that scientists would try and BETTER things, or at least investigate this using logical argumentation and facts. So then they would/could look at things like "higher standards would/could probably help solve problem X better than they way things are going now", or am I wrong?)

        • Done: Enjoyed your comment.

          But as Queen ounce nicely put “dosen’t really matter, any way the noise goes, to them”.

          I have provided a link to something serious I did some while ago in another comment here.

        • People choose careers for different reasons than sticking with them, especially with mortgages, kids to feed and the desire to keep health care. In academia those with less than targeted number of yearly publications do get pushed out (if untenured) or aside.

          Also our brains have remarkably evolved for self-deception that “doing your job depends on not understanding certain things” comment. And the issues discussed in this post are not as well grasped as they could be that widely yet.

          As for cancer researchers, I was involved in some fairly well thought out prospective research (Jeremy Grimshaw’s group) and it appeared that they are so highly motivated, even desperate one might say, to find things that are truly helpful that they seem to overcall evidence (discount uncertainties) and the same trap of accepting noise as real is set (by the good intentions).

    • Phil:

      1. I’d be happy for a study such as this to be published in a journal of Speculations in Psychological Science with a disclaimer that for statistical reasons we should not expect these findings to replicate. But I’m not so happy for it to be published and presented as scientific fact in a leading journal.

      2. Given that the researchers got the dates of max fertility wrong, I don’t think the theory counts for much. Indeed, the fact that they found a pattern that fit their theory—even while getting their theory wrong—indicates the power of researcher degrees of freedom, the ability to find patterns in data to confirm one’s preconceptions.

    • I would think that theory would PREDICT frumpy gray colors since humans, I thought, have evolved concealed ovulation from ancestors that advertise their fertility with beautiful swollen genitals. But probably that is the point. It is really really easy to come up with a plausible model after finding any association.

  7. One penguin turns to the second and asks, “Are you wearing a tuxedo?”

    The second penguin says, “How do you know I’m not?”

  8. I think the biggest problem with the fertility window stuff is that it really is HIGHLY variable, so 6-14 days is basically just as probable for ovulation as 14-21 or so.

    examples from a quick google search: http://www.ncbi.nlm.nih.gov/pubmed/11082086

    The broadness of the fertility window makes it pretty difficult to figure out anything without actually controlling for ovulation using some kind of method (such as body temperature measurements or hormone level measurements or something).

  9. I’m surprised that the journal “Medical Hypotheses” hasn’t been mentioned – maybe a social psychology version would be a good place for these types of headline-grabbing studies that almost certainly can’t be replicated.

    [Medical Hypotheses used to not even be peer reviewed; papers were accepted by the editorial board. Apparently that has now changed. Also, it’s an Elsevier journal. I’ve found it to be a great source of papers to use in class — the weird and wonderful subject matter gets the students’ interest, and helps keep it while you go through the statistical analysis of the data presented.]

Comments are closed.