PPNAS: How does it happen? And happen? And happen? And happen?

Screen Shot 2013-08-03 at 4.23.29 PM

In the comment thread to today’s post on journalists who take PPNAS papers at face value, Mark asked, in response to various flaws pointed out in one of these papers:

How can the authors (and the reviewers and the editor) not be aware of something so elementary?

My reply:

Regarding the authors, see here. Statistics is hard. Multiple regression is hard. Figuring out the appropriate denominator is hard. These errors aren’t so elementary.

Regarding the reviewers, see here. The problem with peer review is that the reviewers are peers of the authors and can easily be subject to the same biases and the same blind spots.

Regarding the editor: it doesn’t help that she has the exalted title of Member of the National Academy of Sciences. With a title like that, it’s natural that she thinks she knows what she’s doing. What could she possibly learn from a bunch of blog commenters, a bunch of people who are so ignorant that they don’t even believe in himmicanes, power pose, and ESP?

P.S. Let me clarify. I don’t expect or demand that PPNAS publish only wonderful papers, or only good papers, or only correct papers. Mistakes will happen. The publication process is not perfect. But I would like for them to feel bad about publishing bad papers. What really bothers me is that, when a bad paper is published, they just move on, there seems to be no accountability, no regret. They act as if the publication process retrospectively assigns validity to the work (except in very extreme circumstances of fraud, etc.) I’m bothered by the PPNAS publish-and-never-look-back attitude for the same reason I’m bothered that New York Times columnist David Brooks refuses to admit his errors.

Making mistakes is fine, it’s the cost of doing business. Not admitting your mistakes, that’s a problem.

35 thoughts on “PPNAS: How does it happen? And happen? And happen? And happen?

  1. I feel your pain today Andrew. Let’s see if the “major airline” implements any new policies motivated by today’s revelations. My guess is not. The paper won’t fix air rage, it will only induce blog rage.

    My regression class just finished today. I’m conflicted as to what to tell them. Something like “Please promise me you’ll never use what we learned to write a paper like this…..unless it’s important to you to get a prestigious publication.” It’s kind of useful to them that it happens, and happens, and happens. It makes a market.

    • Eric:

      I wonder about the airlines. Airline executives read the newspaper too and they really don’t want air rage. So I could see them considering a change in boarding policy as a relatively inexpensive solution.

      There’s also the indirect route, by which Malcolm Gladwell or Jeff Jarvis or some other business guru puts this piece of inspiring advice in a bestselling book, and then the executives take it from there.

      Maybe not this particular study, as anyone who googles it might end up seeing this series of posts—but some other Psych Sci or PPNAS paper that we didn’t bother to discuss on this blog.

        • Rahul:

          I don’t want the airlines to change policy based on this paper. I’m just saying I could see how they might do so. I’m pushing back against Eric’s assumption that these silly studies have no real-world consequences. They might; I have no idea. I don’t know how decisions get made at these companies.

        • +1

          Crappy p-value based results may fool Journalists but where there’s real money at stake the decision makers in industry can be incredibly astute & not so easy to fool (MBAs excepted).

  2. >>>Making mistakes is fine, it’s the cost of doing business. <<<

    If you focus on quantity but devote marginal resources to Quality Control you will keep making lots of mistakes.

    The big problem with current publication processes is that we have consciously prioritized quantity over quality.

    Some mistakes may be inevitable but we should not automatically attribute the current high rate of mistakes as just the default "cost of doing business".

    • Rahul:

      I think part of it is that the expectations are too high. It’s not enough to collect some cool data and suggest some hypotheses; the convention is that something has to be proved (that is, p less than .05, or .001, or whatever your standard is). Consider the division of labor: maybe whoever it is who has the organizational ability to get a dataset of air rage incidents isn’t a person with good statistical judgment. Too bad the researcher who gets the data can’t just publish the data and then allow others to take the next step. But a paper with no strong claims, that could never be published by PPNAS. It would have to go into some obscure journal with a title like Air Transportation Studies, and nobody (including us) would ever hear about it. So, sure, yeah, this paper should’ve been rejected. But I see the big problem being that this is the kind of paper that journals are looking for, and that media organizations salivate over. (And the news media are part of the game; it’s pretty clear that the people running these journals really do value the publicity.)

      • But what you call “obscure journals” are exactly where such papers should be published. What’s wrong with publishing an analysis of Air Rage in an Airline Industry Journal?

        The industry insiders & decision makers are more likely to read it there than in PNAS.

        But the fundamental problem is that DeCelles / Norton / Fiske hardly care about solving the specific problem. What they want is fast-track academic career progression. To that a PNAS paper, NPR publicity, NYT coverage all of that matters hugely & not some industry Journal publication that only industry insiders will read and act on.

        • Let’s formulate the issue another way.

          Getting the best jobs requires publishing in high status journals like PNAS. It might be fine as far as paper impact goes for these kind of papers to be published in Air Transportation Studies or whatever but researchers are heavily incentivized to publish in a high status journal.

          You might think that makes sense. If you want the best jobs you need to produce the best quality work that deserves to be published in PNAS. This, however, is where the overly high expectations come in.

          The difference between really high caliber work that will be accepted by a journal like PNAS and high caliber work that will only be published in Air Transportation Studies is almost entirely a matter of luck. You can be just as good of a researcher and have just as plausible a hunch and find that the data is merely suggestive of what you believe not conclusive (without playing dirty) in the way that would get a paper published in PNAS. So of course researchers have a crazy strong incentive to tweak things until they get this kind of publication.

          In other words the overly high expectations ultimately come from hiring committees who only want to hire researchers who successfully show a strong effect rather than researchers who show equally strong judgement and research ability but don’t happen to get a big result.

  3. I was involved in an effort to get PNAS to at least issue a correction for the Fredrickson et al. “Different forms of well-being are associated with different gene expression” paper. They sent it to a statistical consultant who reported back to them that s/he “rreally didn’t know what to make” of the fact that we had demonstrated that any possible combination of the 14 psychometric data items into two factors gave similar results to the originals.

    My conclusion is that PNAS is not actually interested in whether what they publish is correct. This is somewhat worrying.

    • > s/he “rreally didn’t know what to make” of the fact that we had demonstrated that any possible combination of the 14 psychometric data
      > items into two factors gave similar results to the originals.

      Nick: That does seem like a failure of the statistical consultant not PNAS – PNAS hired an expert statisticians and the statistician confirmed that a benefit of doubt should be given to the paper as they really didn’t know what to make of the technical criticism.

      Maybe I am not getting what you are reporting on, but is seems a failure of a statistician to understand/communicate what PNAS need to _hear_.

        • It happens often in many contexts, as does hiring experts that give less than competent advice.

          It would be nice to know the percentages for the relevant set of journals from various fields seeking statistical consultation.

          (My very first paper was held up for over a year as the journal sought expert statistical advice – http://www.ncbi.nlm.nih.gov/pubmed/3300460 – allowing our _competitors_ to publish first.)

  4. May I respectfully say that one thing that you should at least consider, Andrew, is that you may be a tiny bit wrong occasionally.

    I’ve been reading your blog for years and am generally a fan. I am also completely on board that there are major problems with many of the papers that are coming out in some of these top journals due to garden of forking paths, etc.

    However, your rhetoric seems over the top to me. First, the majority of papers coming out in Science, Nature, PPNAS, etc. are actually quite good. You seem to be sensationalizing/overgeneralizing a bit in many of your posts. Second, even your “take-downs” of certain papers are at times a bit lacking. For example, you spent 75% of your time in the most recent discussion of the airline paper (e.g. your first 3 bullet points) complaining about the reporting of too many decimal places – annoying I know, but hardly damning.

    It may be nice to point out that at the very least, this paper is trying to do something with more than N=20 or something.

    As an applied economist who gets frustrated with this stuff too, I think what you are doing is generally good. But, a bit more humility on your part I think will make you more relevant, not less.

    • Jack:

      You write that I should consider that I “may be a tiny bit wrong occasionally.” Of course—I’m wrong all the time, and not just a tiny bit! Twice I issued published corrections of my papers because they were so wrong that their main conclusions were invalidated. I proved a false theorem once! And the other time I’d reverse-coded a variable, making the rest of the paper meaningless.

      You write, “the majority of papers coming out in Science, Nature, PPNAS, etc. are actually quite good.” That could well be—I have no idea and have never claimed anything about the majority of papers in these journals. I have the impression that most of the papers published there are biology papers, and I don’t know much biology.

      You write that the poor data display of the air-rage paper (too many decimal places) is “annoying . . . but hardly damning.” That’s fine, I see no requirement to only or even mostly write “damning” things. My goal here, and elsewhere, is not to “damn” but to understand. In this case I find it relevant because the silly numbers (including the estimate that’s 1 s.e. from 0 but is reported as statistically significant) indicate some innumeracy on the part of the paper’s authors and reviewers.

      You write that I should point out that N is greater than 20. I actually did report N in my post. I guess I could’ve also stated that they are “trying to do something” but I thought that was obvious.

      Finally, you’d like to see more humility. I don’t really see the point in appending to each post the phrase, “I may well be wrong, I’ve been wrong before.” But if you’d like you can mentally do this appending yourself, as it applies to just about everything that I, and everyone else, writes.

      • “In this case I find it relevant because the silly numbers (including the estimate that’s 1 s.e. from 0 but is reported as statistically significant) indicate some innumeracy on the part of the paper’s authors and reviewers.”

        People say I’m nitpicky when I complain about this sort of thing, but I think it’s a huge indicator of how trustworthy a source is. When I’m reading something written by a non-expert, I need a way of quickly determining if that person is capable of taking information from an expert and passing it on with a high degree of fidelity. Failing to understand significant figures is a major red flag.

      • Reporting non-significant results as significant, with obviously wrong p-values, is a very common occurrence. At some point wishful thinking just takes over and we turn to the random number generator.

        • Shravan:

          Well put. Along with this, I think the authors of these papers feel that they know ahead of time that their hypotheses are true, so it’s just a matter of gathering the relevant evidence.

        • Yes, I once reviewed a 1/2 million Euro funding proposal in the which the PI wanted to *determine* that X is true. In the particular problem they were working on, a judicious choice of experimental items would pretty much guarantee that X would come out as true. This is in psychology. I think that people don’t even stop to think about what’s driving them; it’s the desire to confirm what they already believe. Maybe this is what the phrase “confirmatory hypothesis testing” actually means?

          One frustrating thing about made-up p-values is that people don’t take these kinds of mistakes seriously (perhaps OK if you are just being a cynic and know that under the null the p-values have a uniform distributionm, so any number between 0 and 1 is fine). I don’t think any paper is going to face consequences for doing this kind of thing, let alone be retracted. It’s just too small an error, not like the gay-attitudes study. But if you start making small mistakes like this, how many other mistakes happened? Sloppiness in one thing implies sloppiness in other (possibly bigger) things.

    • The paper may have been trying to do something with N>20 but surely we can expect someone who uses regression methods to be familiar with suppressor effects.

    • Jack, you wrote, “the majority of papers coming out in Science, Nature, PPNAS, etc. are actually quite good.”

      I think it is difficult to hold this position in light of analyses such as

      http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114255

      at least for multi-experiment studies in Science 83 percent of the articles seem “too good to be true”. Maybe the rest of the articles in Science (on different topics) pull the quality back up; but at least for psychology Science often publishes pretty poor science. This kind of analysis has not been done for Nature or PPNAS, but my impression is that they have similar problems. (I may well be wrong, I’ve been wrong before.)

  5. Two points:
    1. While court cases are FAR from perfect, they are superior to the publication process in my mind. There is cross examination – you can get an expert to say anything, but they have to defend it in court in the face of cross examination and other “expert” testimony.
    2. People keep referring to problems created by emphasizing quantity over quality, reputation building, etc. The root of all these problems is that academics reward publications and steer clear of evaluating the quality of the work. Why? Because they lack the integrity and personality to say what they really think, because most academics are really quite insecure, because the litigious and accountable society we live in makes it risky to engage in anything that might be called “subjective,” etc. We can fault all sorts of people and institutions, but in the end, we need to be willing to say what we really think about people’s research and use this in hiring, promotion, and evaluation. For sobering insights into how our education system contributes to these dysfunctions, I found
    Disciplined Minds: A Critical Look at Salaried Professionals and the Soul-battering System That Shapes Their Lives Paperback – December 4, 2001
    by Jeff Schmidt (Author)
    to be thought provoking (though I certainly did not agree with all of it, the book does cut through a lot of how the system self-perpetuates these bad habits).

  6. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2769645
    Here is a study out of the UC Berkley law journal that hits all the buttons.
    – Narrative justification that hits all the political buttons (via Glen Greenwald)
    – Fun with a mechanical Turk
    – Single discontinuity in Poisson rates for REASONS!
    – No alternative models.
    – No model testing with data where one does not expect a *significant* result.
    – No discussion of sampling or model parameter uncertainties.

  7. @#%&. That graph. It haunts me.

    “Hey, I have an idea! Let’s fit our data to some arbitrary function – or, better yet, a couple of them! Why, you ask? Because we can. And who doesn’t love a good polynomial fit?”

    I’m reminded of Eli Rabett’s observation a while back: “Data without a good model is numerical drivel. Statistical analysis without a theoretical basis is simply too unrestrained and can be bent to any will.”

    (There’s also Tukey’s observation: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”)

  8. It’s possible that Andrew himself has succumbed to the journalist’s need to provide something newsworthy. If it’s true that many of these journals are producing good, solid work, and maybe most of it is good, then mostly reporting on the bad stuff is sensationalism. But then who would care if there was a blog post saying that someone wrote a carefully reasoned paper in PNAS or whatever?

    Also, there might be a bias here. I would be surprised if there weren’t journals in statistics that produce garbage papers. Either this happens almost never, or Andrew underblogs on the extent to which this happens. I do see the attraction in mocking other fields; it *is* fun. It also feeds into the academic’s innate desire to see himself (and it’s usually a himself) as superior to others.

    I do concede that Andrew’s posts have created a lot of awareness of the problems in data analysis; that is a big contribution in itself. I have also learnt a lot from all these criticisms. But I can understand the frustration of some of the people who feel Andrew goes too far sometimes.

    • Shravan:

      Lots of useless stuff gets published in statistics journals. It just doesn’t make its way into the news, get on Gladwell and NPR, become the 2nd-most-watched Ted talk of all time, etc.—so I don’t hear about it.

      • Good to know that statistics research is not immune. But I guess you do read statistics journals. Highlighting some of the air-rage equivalent bad research would be educational to see some day. Maybe you just write an article about it in those cases.

        I don’t read PNAS, so it would be cool to see some discussion of good statistically-driven research coming out there. If the publish daily, then it could be than only about 0.5% of their papers are air-rage bad. After all, NAS members are established scientists with remarkable achievements behind them.

        • In this case however the NAS member who served as the action editor on this paper had published with one of the authors and served on his dissertation committee. She has also appears to have remarkably poor statistical training if she is unfamiliar with suppressor effects in regression analysis and believes that you can use a t-test to compare mean scores on two different variables as she appears to do in this paper (written with Amy Cuddy): http://onlinelibrary.wiley.com/doi/10.1111/j.0022-4537.2004.00381.x/abstract

        • Wow (re t-test); you do mean comparing ratings on women being competent vs warm, right? Also, in the paper, sex and parental status are *manipulated* independent variables.

          Seeing an action editor who is close to the author is not anything new to me. I have seen papers in which the AE worked down the hall from the first author. It’s probably a conflict of interest; journals need to take care of that. Although I must say that I’d rather be reviewed or action-edited by someone who I have a good relationship with than someone who either actually hates me (for ideological or other reasons) or doesn’t know me at all. I’ve noticed that the last two categories tend to take out their latent aggression disproportionately in the review process. So neither knowing your AE or not knowing your AE is good.

          The paper you mention has our good friend Amy Cuddy as first author, and I assume that she’s the one who doesn’t understand what she’s doing with the t-test. Somehow I doubt very much that Fiske went through the analyses in any detail. The claims are plausible even without any data gathering so one could easily read tthe intro and discussion section as co-author (as many co-authors apparently do, remember the fake gay-study?).

          Do Princeton and Harvard have statistical consultancy sections? Do they teach statistics to non-statisticians over there?

        • @Shravan. Yes – that is what I mean. I am stunned that someone with a PhD would make such an elementary error. I’d like to think that none of my undergraduates would do that and it is oddg that the paper has been cited around 500 times without the issue ever having been identified. I only read it because Fiske and Cuddy are both on it and I was wondering if their strange habits at manifest at an earlier date.

        • Mark:

          Yes, that is a bad paper. But I think you may be overrating your undergraduates. My impression is that students usually wouldn’t make these mistakes, not because the students know better, but because they don’t do such elaborate analyses. Student projects tend to be more focused; it’s a rare student who would have the creativity (if you want to call it that) to perform so many different analyses and compare so many different p-values. What happened here is that Cuddy, Fiske, and Glick seemed to have given themselves enough rope to do all kinds of weird things, a sort of parody of the garden of forking paths where they walk all over the place. The abstract alone reveals the sort of deterministic thinking that is such a disaster in statistical studies.

          Anyway, I don’t think that Cuddy, Fiske, and Glick are unusually incompetent empirical researchers. From the evidence here, they seem to suffer from standard-grade statistical incompetence, it’s just that we hear a lot about them because Cuddy is such a successful public speaker and Fiske is a successful academic and uses her perch at PPNAS to publish lots of that sort of stuff.

        • Actually, I wonder if there is any way at all to compare two sets of between subject measurements that measure two different things. If you are getting ratings on being competent ratings on being warm, on a Likert scale, and want to compare them using a t-test, just plugging the raw scores into the t.test function is not going to be valid. Would putting them on the same scale (say ranging from 0 to 1) and standardizing them so that variance is 1, say, be any help?

          I didn’t read the paper carefully so I don’t know if these measures are within subject (but they are likely to be). In that case it’s just a multivariate analysis problem, analogous to measuring vase heights and widest point. One could certainly study the relationship between these two dependent measures.

        • Mark:

          I don’t know any of the people involved but my guess is that the editor (a) knew and liked the authors of the paper, and (b) found the conclusions of the paper to be congenial.

Leave a Reply to delazeur Cancel reply

Your email address will not be published. Required fields are marked *