Questions about “Too Good to Be True”

Greg Won writes:

I manage a team tasked with, among other things, analyzing data on Air Traffic operations to identify factors that may be associated with elevated risk. I think its fair to characterize our work as “data mining” (e.g., using rule induction, Bayesian, and statistical methods).

One of my colleagues sent me a link to your recent article “Too Good to Be True” (Slate, July 24). Obviously, as my friend has pointed out, your article raises questions about the validity of what I’m doing.

A few thoughts/questions:

(1) I agree with your overall point, but I’m having trouble understanding the specific complaint with the “red/pink” study. In their case, if I’m understanding the author’s rebuttal, they were not asking “what color is associated with fertility” and then mining the data to find a color…any color…which seemed to have a statistical association. They started by asking “is red/pink associated with fertility”, no? In which case, I think the point their making seems fair?

(2) But, your argument definitely applies to the kind of work I’m doing. In my case, I’m asking an open ended question: “Are there any relationships?” Well, of course, you would say, the odds are that you must find relationships…even if they are not really there.

(3) So let’s take a couple of examples. There are 1,000’s of economists building models to explain some economic phenomenon. All of these models are based on the same underlying data: the U.S. Income and Product Accounts. There are then 10,000’s of models built—only a handful of are publication-worthy. So, by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?

(4) Another example: one of the things that we have uncovered is that, in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?

(5) A caveat: In my case, we use the statistically significant findings to point us in directions that deserve more study. Basically as a form of triage (because we don’t have the resources to address every conceivable hazard in the airspace system). Perhaps fortunately, most of the people I deal with (primarily pilots and air traffic controllers) don’t understand statistics. So, the safety case we build must be based on more than just a mechanical analysis of the data.

My reply:

(1) Whether or not the authors of the study were “mining the data,” I think their analysis was contingent on the data. They had many data-analytic choices, including rules for which cases to include or exclude and which comparisons to make, as well as what colors to study. Their protocol and analysis were not pre-registered. The point is that, even though they did an analysis that was consistent with their general research hypothesis, there are many degrees of freedom in the specifics, and these specifics can well be chosen in light of the data.

This topic is really worth an article of its own . . . and, indeed, Eric Loken and I have written that article! So, instead of replying in detail in this post, I’ll point you toward The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

(2) You write, “the odds are that you must find relationships . . . even if they are not really there.” I think the relationships are there but that they are typically small, and they exist in the context of high levels of variation. So the issue isn’t so much that you’re finding things that aren’t there, but rather that, if you’re not careful, you’ll think you’re finding large and consistent effects, when what’s really there are small effects of varying direction.

(3) You ask, “by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?” My response: No, I don’t think that framing statistical statements as “true” or “false” is the most helpful way to look at things. I think it’s fine for lots of people to analyze the same dataset. And, for that matter, I think it’s fine for people to use various different statistical methods. But methods have assumptions attached to them. If you’re using a Bayesian approach, it’s only fair to criticize your methods if the probability distributions don’t seem to make sense. And if you’re using p-values, then you need to consider the reference distribution over which the long-run averaging is taking place.

(4) You write: “in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?” My response is, first, I’d like to see all the comparisons that you might be making with these data. If you found one interesting pattern, there might well be others, and I wouldn’t want you to limit your conclusions to just whatever happened to be statistically significant. Second, your finding seems plausible to me but I’d guess that the long-run difference will probably be lower than what you found in your initial estimate, as there is typically a selection process by which larger differences are more likely to be noticed.

(5) Your triage makes some sense. Also let me emphasize that it’s not generally appropriate to wait on statistical significance before making decisions.

25 thoughts on “Questions about “Too Good to Be True”

    • Tom:

      Don’t forget that the recognition and visualization of uncertainty and variation are a key part of exploratory data analysis. EDA is not just about looking at the data, it’s about looking at the data in context.

  1. Andrew:

    Re your criticism #1 can you point to studies where you think that the analysis is not contingent on the data? Is pre-registration a sufficient condition to preclude this criticism?

    A necessary one? Can I / Must I apply the “garden of forking paths” criticism to every study that isn’t pre-registered?

    These are points that weren’t clear to me the last I read that article.

    • Rahul:

      There are occasional preregistered studies where the analysis is not contingent on the data but I agree this is rare. I’ve never done anything like that myself. I think that it’s best to recognize that our analyses are contingent and to more toward methods of data analysis that use more information. I like hierarchical models but there are other approaches that could work too. For what I have in mind, you can take a look at the analyses in my books and research articles. The key is that I’m trying to do my best job at estimation in the presence of variation; I’m not trying to learn by statistically-significantly rejecting null hypotheses.

      • So, if I understand what you wrote, almost all journal studies(non-pre-registered?) can be criticized on the basis of your “garden of forking paths” criticism? If so thanks for clarifying.

        To me then, that’s just an awfully broad, non-specific criticism of every study ever published that uses p-values. True? In other words, you’d have criticized the Beall-Tracy Red-Pink study in any case, so long as they used traditional significance testing?

        For now we can apply the garden of forking paths critique selectively to studies whose conclusions we do not like?

        I’m all for your alternative approaches, hierarchical models etc. but I’m still lost about this forking paths business. To me it seems like a vague exercise, highly subjective and lacking any sort of quantitative or well defined yardstick to decide whether a particular author / study has been guilty of committing this garden of forking paths blunder.

      • Rahul:

        The forking paths thing is not a criticism of a study, it’s just a description of what happens. The criticism is of the interpretation of p-values computed in that way. A p-value is particularly vulnerable to the “forking paths” criticism because it (the p-value) is explicitly a statement about what would have happened had the data been different. Other statistical analyses are less affected by forking paths.

      • Rahul:

        P.S. I agree that my criticisms have a subjective element. Research is subjective and so is criticism, that’s just the way it is. I do not claim to have a machine that reads papers and spits out criticism.

        Regarding your statement, “you’d have criticized the Beall-Tracy Red-Pink study in any case, so long as they used traditional significance testing,” you have to define the counterfactual more clearly. What is “in any case” here? If they’d analyzed all their data directly and not claimed to have gained statistically significant general knowledge, then I would have had a lot less to criticize. I could still have criticized the study for having noisy and biased measurements, though. On the other hand, if they’d presented their work as simply being exploratory, the paper wouldn’t have appeared in Psychological Science, it wouldn’t have received all the press attention, and I wouldn’t have heard about it. This is the hype paradox that we discussed recently.

        In short: What happened is that some researchers got a potential career boost and lots of publicity based on a noisy study. Loken and I pointed out that the study is noisy and can no way support the conclusions that they made. The forking paths was the way that Beall and Tracy attained statistical significance. Without forking paths, statistical significance doesn’t come easy. But it’s still a noisy study, it would just be a noisy study not accompanied by strong scientific claims.

        • Andrew:

          I think it is unfair to characterize my position as expecting a machine that spits out criticism.

          No doubt subjectivity creeps in but in general I think we try to minimize subjectivity and ad hoc judgments or vague pronouncements in the scientific enterprise. I don’t see any such concrete guidelines about where to apply your forking paths criticism and that’s what I’m pointing out.

          Your criticism of noisy data or biased measurements are far more defensible. OTOH at least those are acceptably applied to specific studies whereas your previous comment seems to suggest that forking paths isn’t even a specific criticism of any particular study so it becomes hard to even discuss what might be a valid defense against such a criticism.

          The take home message for me seems this: If you want to avoid being accused of committing the forking paths blunder (a) If you must use p-values then pre-register your study in great detail or (b) Don’t use p-values and you can walk the path unmolested at least by the forking paths monster.

        • PS. I entirely agree with you that the Beall and Tracy conclusions seem crappy.

          Where I disagree is the forking paths criticism as a reason why. That bit seems a post-hoc rationalization. There are tons of respectable studies that use p-values and yet those we don’t criticize that could have been criticized for exactly the same forking-paths reason. Yet, they use less noisy measurements, better study designs, higher-powered studies etc. and reach conclusions we agree with.

          This causes me to think that we selectively apply this forking-paths criticism to studies whose conclusions we don’t like and that to me seems a wrong way of doing things.

        • Rahul:

          The forking paths argument is necessary because otherwise the authors of this and similar papers could say: “OK, sure we have noisy measurements and sure there’s some dispute about the dates of ovulation. But we got statistical significance! You can’t argue with p<0.05." Beall and Tracy even said something about having two p<0.05 results which gives p<1/400, and arguing that there's no way they could've done 400 separate analyses. The p-value argument is a sort of external reasoning that states that the results are important, even if they violate various substantive theories (in the ESP example) or seem too large to be plausible (as in the ovulation studies).

          So it's necessary for me, and for other critics, to respond to the p-values. One response is to say, no, the p-values aren't real, they're the result of p-hacking. This is pretty much the argument that Loken and I are making, but in particular we're emphasizing that this can be an issue, even if the researchers in question did only one analysis of their particular data.

          And, once the p-value disappears, there's nothing holding up the study, cos all you have left is a small unrepresentative sample with biased and noisy measurements, along with an effect estimate that's too large to be plausible. The p-value was the only think holding the paper together, so a deadly criticism of the p-value is fatal to the paper. This is not the case of all published research. There's a lot of research with high-quality measurements, large and representative samples, and plausible effect estimates.

        • In which case the right approach (IMO) is to just indiscriminately attack all papers using p-values. Even a paper that did careful measurements with less noise etc. but used p-values to make its point ought to be regarded as fatally flawed and stridently criticized.

          Even an excellent, meaningful conclusion drawn on the basis of a p<0.05 should be considered invalid because we say statistical significance itself is a metric so flawed that we refuse to use it. e.g. Let's refuse to review papers that use p-values. Or make it a journal policy to not allow significance testing. Or lobby to stop teaching it entirely. Or refuse to blog any study that uses p-values. That position I can understand.

          I'm perfectly fine in you criticizing p-values but then let's do that across the board. It's the bit about invoking such arguments selectively only when the conclusions are unpalatable that disturbs me.

        • Rahul:

          Actually, I think p-values can be a good data summaries, in the right context. The key is to think of the p-value as a data summary and go from there, rather than to think that p<0.05 or even 0.01 represents a demonstration that a theory is true. Much depends on context, in particular on the relative sizes of signal and noise.

        • Andrew:

          Now I’m utterly confused. In which way are p-values a good data summary? They might or (rather) might not be a good indicator for the generalizability of a particular claim based on the data at hand but they are mostly useless for describing the actual relationship in the particular sample. I don’t understand inhowfar p-values “can be good data summaries”. Do you just mean that they are functions of the sample size and the standard error and thus indirectly inform us about them? But in which way are they preferable over well… the standard errors and the sample size? Maybe I’m just too tired, but I really don’t understand this remark.

        • Andrew:

          Thank you. I knew that paper already and I think I mostly understand it. Probably I just misunderstood you here since I thought you might mean something additional with data summaries here.

  2. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

    Why we (usually) don’t have to worry about multiple comparisons

    It would be nice to get a Cliff notes version of the differences in these articles, because obviously A and not A are not both true simultaneously. Therefore you are referring to different assumptions, conditions, techniques, etc. Wasserman was very good about producing very simple examples of the underlying idea of a concept and then referring the reader who wanted to know more to the appropriate paper(s). You tend to be prolix–I think it’s your infatuation with literature–the best poem (and palindrome, to boot!) is “Madam, I’m Adam”–KISS.

    • Numeric:

      I actually have a post scheduled on this. In any case, I think both papers are valuable and I think the (statistical) world would be a worse place with only one, or neither, of them. I stand by what I wrote in both papers.

      But the short answer is that multiple comparisons can be a problem but not so much if we are fitting hierarchical models, which is the point of that other paper of ours.

      • I would have preferred “If you shrink yer damn’ estimates: We (usually) don’t have to worry about multiple comparisons.”

        The title likely suggests, at least to the casual reader, that multiplicity does not matter and you usually don’t need to do anything non-standard (like hierarchical models).

    • Here’s my attempt at Cliff’s Notes version of the difference.

      Garden of Forking Paths: p-values cannot be taken at face value when data analysis choices are data-dependent (even if in minor/subtle ways).

      No Worries about Multiple Comparisons: Shrink yer damn’ estimates!

        • When drug companies agree with the FDA how to do the analysis before the studies are conducted and the FDA then gets all the data and redoes both the exact analysis plan and any modifications that were argued as necessary given what happen in the trials.

          So some data dependency is likely unavoidable but it can be minor and its effect assessed.

  3. “In my case, we use the statistically significant findings to point us in directions that deserve more study. Basically as a form of triage”

    which should include not only significance but effect size. In fact, primarily effect size.

    • Gotta be careful about that effect size thing. Recently we’ve been talking about a bunch of papers that report statistical significance with huge effect sizes. In this case the problem is not the popular textbook story, “statistically significant but not practically significant,” but rather “statistically significant but just noise, so the reported effect size is pretty much meaningless” (as discussed in < a href="http://www.stat.columbia.edu/~gelman/research/published/retropower20.pdf">this paper with John Carlin).

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *