The “scientific surprise” two-step

During the past year or so, we’ve been discussing a bunch of “Psychological Science”-style papers in which dramatic claims are made based on somewhat open-ended analysis of small samples with noisy measurements. One thing that comes up in some of these discussions is that the people performing the studies say that they did not fish for statistical significance, and then outsiders such as myself argue that the data processing and analysis is contingent on the data, and this can occur even if the existing data were analyzed in only one way. This is the garden of forking paths, of which Eric Loken and I give several examples in our paper (and it’s easy enough to find lots more).

One interesting aspect of this discussion is that researchers with multiple comparisons issues will argue that the comparisons they performed on their data were in essence pre-chosen (even though not actually pre-registered) because they are derived from a pre-existing scientific theory. At the same time, they have to make the case that their research is new and exciting.

The result is what I call the “scientific surprise” two-step: (1) When defending the plausibility of your results, you emphasize that they are just as expected from a well-estabilished scientific theory with a rich literature; (2) When publicizing your results (including doing what it takes to get publication in a top journal, ideally a tabloid such as Science or Nature) you emphasize the novelty and surprise value.

It’s a bit like when we are asking for a NSF research grant. In the proposal, you pretty much need to argue that (a) we can definitely do the work, indeed you pretty much know what the steps will be already, and (b) this is path-breaking research, not just the cookbook working-out of existing ideas.

Don’t get me wrong: It’s possible that a result can be (1) predicted from an existing theory, and (2) newsworthy. For example, the theory itself might be on the fringe, and so it’s noteworthy that it’s confirmed. Or the result might be noteworthy as a confirmation of something that was already generally believed.

But if the result is genuinely a surprise, even to the researchers who did it, this should suggest that the finding is more exploratory than confirmatory.

Sociologist Jeremy Freese picked up on this in a recent series of posts on the notorious recent study on Himmicanes and Hurricanes. Freese writes:

A common narrative is that something inspires a hypothesis, researchers conduct a study to test that hypothesis, and then, more than merely finding a result that supports their hypothesis, the researchers were shocked by how big the effect turned out to be. . . .

If the only way somebody could have gotten a publishable result is by finding an effect about as big as what they found, doesn’t it seem fishy to claim to be shocked by it? After all, doing a study takes a lot of work. Why would anybody do it if the only reason they ended up with publishable results is that luckily the effect turned out to be much more massive than what they’d anticipated?

Wouldn’t you hope that if somebody goes to all the trouble of testing a hypothesis, they’d do a study that could be published as a positive finding so long as their results were merely consistent with what they were expecting?

Let’s consider this for fields, like lots of sociology, in which a quasi-necessary condition of publishing a quantitative finding as positive support for a hypothesis is being able to say that it’s statistically significant–most often at the .05 level. Now say a study is published for which the p-value is only a little less than .05. Here it is obviously dodgy for researchers to claim surprise. They went ahead and did their study, but had the estimated effect been much smaller than their “surprise” result, they wouldn’t have been able to publish it.

Then Freese flips it around:

Of course, a different problem is that researchers may begin an empirical project without any actual notion of what effect size they are imagining to be implied by their hypothesis. . . . I [Freese] have taken to calling such projects Columbian Inquiry. Like brave sailors, researchers simply just point their ships at the horizon with a vague hypothesis that there’s eventually land, and perhaps they’ll have the rations and luck to get there, or perhaps not. Of course, after a long time at sea with no land in sight, sailors start to get desperate, but there’s nothing they can do. Researchers, on the other hand, have a lot of more longitude—I mean, latitude—to terraform new land—I mean, publishable results—out of data that might to a less motivated researcher seem far more ambiguous in terms of how it speaks to the animating hypothesis.

I agree completely. It’s the garden of forking paths, in the ocean!

24 thoughts on “The “scientific surprise” two-step

  1. http://en.wikipedia.org/wiki/Myth_of_the_Flat_Earth#Irving.27s_biography_of_Columbus

    The roundness of the earth with fairly good estimates was well known since ancient times. Columbus had his own mistaken estimates which put Asia much closer to Europe and might have lead to disaster, but he was saved the existence of continents knew to him.

    I call this “Columbian Progress” since it illustrates a fact familiar to most accomplished scientists, namely

    “make sure your errors come in twos: there’s a chance they’ll cancel.”

  2. Every experiment is a test of multiple theoretical statements — the central scientific question as well as various supporting theories, measurement/instrumentation theories, etc. So it’s important to ask whether the thing someone is claiming to be surprised by is the same thing they’re claiming was well supported. For example, it’s totally consistent to say that you were surprised by the relationship between X and Y, but that (a) standard practice in a field constrains the potential analytic choices in how you measure X and/or (b) previous research supports the validity of your measurement procedure of X (and likewise for Y). In fact, not only is that consistent, it’s highly desirable.

    Since you’re linking this issue to multiple-comparison issues, it is also important to recognize that (a) and (b) above are different things. Take the recent example about inequality and happiness (http://statmodeling.stat.columbia.edu/2014/07/21/skepticism-published-claim-regarding-income-inequality-happiness/). As a social/personality psychologist I can tell you that it is standard practice in my field to apply an equal-interval assumption when scoring rating scales. One of the criticisms De Libero made was that the equal-interval assumption of the happiness scale may be incorrect. It’s totally fair to question that assumption — but in order to make his point he reanalyzes the data in a way that opens up a massive multiple-comparison problem where a social-personality psychologist who felt constrained to follow their field’s standard practices would have had none. My reaction to that part of the critique was (1) I was completely unsurprised that it was possible to find a rescaling that changes the p-value, and (2) if I was an editor or reviewer on a manuscript where somebody did some kind of standard-practice-defying scaling of a rating scale and only justified it with “it didn’t seem reasonable,” I’d tell them they need to come back with something stronger than that — exactly because of the multiple comparison problem it creates.

    As an outsider it is not surprising that De Libero does not fully appreciate that strength of that precedent and the role it serves. Frankly it’s valuable to have outsiders come in and question assumptions that everybody in your field takes for granted. But it’s also frustrating when they miss the value in the way things are done by the domain experts. I have had similar frustration watching your back and forth with Beall and Tracy, because I think your forking paths paper makes a lot of good points but it also misses some of the value in their responses to your critique.

    • “if I was an editor or reviewer on a manuscript where somebody did some kind of standard-practice-defying scaling of a rating scale and only justified it with “it didn’t seem reasonable,” I’d tell them they need to come back with something stronger than that — exactly because of the multiple comparison problem it creates.”

      I understand the spirit of what you’re saying, but also think this view (which, as far as I can see, is very common in our field) is not without some big problems. I think it ends up being one of the many factors contributing to the slow rate of adoption among researchers of more modern and technically sound data analytic procedures. Basically researchers know that there are many reviewers/editors who are likely to say “Why did you use fancy-pants statistical method X? It must be because you tried the traditional methods first and the data did not favor your hypothesis, so you went fishing around using other procedures!” And hey, I’m sure in some cases this is probably true. But whether it’s correct or not in any particular case, the basic result is that researchers tend to get penalized for using better, more modern statistical procedures, and tend to be rewarded for using questionable but traditional ones. So a long-term side-effect of the sentiment you expressed is that the field as a whole advances statistically at a glacial pace, because researchers are afraid of using better methods for fear of upsetting reviewers/editors.

      Pre-registration of studies and analysis plans could help with this by allowing researchers specify in advance all the fancy-pants procedures they are going to use in order to maximize their future chances of finding true effects.

      • In many cases though, the fishing for methods is very real. Also, oftentimes the benefit of the more modern procedures is marginal at best.

        The most robust or really relevant findings often are clear even when decades old statistical procedures are applied (correctly).

  3. I’m all for using newer and better statistical procedures. And I’ve been on the receiving end of the kind of resistance you’re describing. But that is not what I was talking about. I’m talking about an author — or critic! — who wants to override standard practice (and give himself extra researcher df) without sufficient justification. “It didn’t seem reasonable” was an actual quote from the critique, and there were no fancy-pants statistics involved in the rescaling – he just doubled the interval between 2 response options based on a hunch and then re-ran the analysis.

    That’s why I said “they need to come back with something stronger.” I’d be happy with pre-registration. Or analyses that validate the alternative scaling in an independent dataset. Probably even a fancy-pants statistical procedure with a demonstrated track record. Or best, all three.

    • Sanjay,

      You mention “a demonstrated track record”, and above “the value in the way things are done by the domain experts”. What is it that has convinced you that the method has been useful and lead to correct inferences?

      For example, in medical research we have had a few reports claiming only 10-30% of results are replicated:
      http://www.newscientist.com/article/mg21528826.000-is-medical-science-built-on-shaky-foundations.html#.U9v5FLFQyoU

      I have not seen anything accomplished by the people who support “well established” methods used in this type of research to address these issues, so we do not know if the problem is on the bayer/amgen end or the academia end. However, if we accept that the majority of researchers are coming to false conclusions (currently supported by the publicly available info) this would seem to invalidate any argument about track record, etc.

      Is there data on the track record of sociology? What approaches have people taken to assess this track record?

      • Question,

        As I said above, it’s important to distinguish between the multiple comparisons issue and the validity issue. When standard practice / precedent constrains researchers to do things one particular way, that reduces the multiple comparisons issue. It is still possible for a field to settle on a wrong way of doing things, or (more likely) a less-than-perfect way of doing things. But since departing from standard practice opens up the multiple comparisons problem, it is reasonable to expect some kind of support or justification when people claim that a novel approach is an improvement.

        Replicability is a complicated issue. Multiple comparisons and the Gelman & Loken “garden of forking paths” problem is definitely a major part of it. I have argued on my blog (http://wp.me/pt9Wa-yt) that we psychologists should do more to independently validate our methods, especially experimental manipulations (we have more of a tradition of independently validating measurements than experimental manipulations). That would help both the multiple comparisons issue and the validity issue.

        I don’t know about sociology, but in psychology there are efforts to study the replicability of findings in our major journals, such as the OSF Reproducibility Project: https://osf.io/ezcuj/wiki/home/

        • Sanjay,

          I agree that deviating from “well established” procedures often suggests multiple comparisons have occurred. I’d also say that along with attempts to “sample to a foregone conclusion”, this is probably the biggest p-hacking technique used by biomed researchers. However, I am wary of unreferenced claims that well-established procedures have been productive. Often this seems to be based only on what has been acceptable for publication. As another example from biomed, we can find reports that nothing has been accomplished on treating people with brain injuries for many years, eg:

          “Despite claims to the contrary, no clear decrease in TBI-related mortality or improvement of overall outcome has been observed over the past two decades.”
          http://www.ncbi.nlm.nih.gov/pubmed/23443846

          In that case, if there was a clear decrease in mortality, etc then we would have some basis for believing that established procedures have lead to practical results. This is not what has occurred, so I have trouble taking the claims made by that field seriously. A weaker type of evidence would be multiple independent replications demonstrating similar effect sizes (of course the experimental design being repeated may still be flawed in some way).

          Another thing, looking at: Investigating Variation in Replicability: A “Many Labs” Replication Project
          https://osf.io/wx7ck/osffiles/ManyLabsManuscript.pdf/version/2/download/

          They claim:
          “This research tested variation in the replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants. In the aggregate, 10 effects replicated consistently.”

          But figure 1 shows only 2/13 original effect sizes within the 99% CIs. The definition of replication used to get the 10/13 number appears to be a new, weaker definition. They did not demonstrate a stable effect.

        • Sanjay,

          A couple of instances come to mind where I (as an “outsider”) question choices that seem common in your field. I would be interested in your opinion on them, and of my reasons for questioning them.

          1. In stereotype susceptibility research, it seems to be the norm to consider a measure called “accuracy,” which is defined as “number correct divided by number attempted,” where number correct/attempted refers to questions on a math exam. This seems strange to me since, for example, it considers getting 2 correct out of 2 attempted as “better” performance than, say, 8 correct out of 12 attempted. (And also equivalent to 12 correct out of 12 attempted.) This seems pretty arbitrary to me – I would consider 8 correct out of 12 attempted to be better than 2 correct out of 2 attempted, and 12 correct out of 12 attempted to be better performance still. I also have concerns with “accuracy” from a statistical perspective: as a quotient of two random variables (number correct and number attempted), it may have a distribution with strange properties. I have not done a thorough investigation as to why this measure was chosen, but based on what little I have done (see http://www.ma.utexas.edu/blogs/mks/2014/06/22/beyond-the-buzz-on-replications-part-i for a discussion of this), I haven’t found a convincing rational, so can’t help but wonder if it was chosen because the more natural measure (simply “number correct”) did not give the hoped for results. (Why I wonder this: The papers I looked at use both measures, but in most cases, only “accuracy” gives statistical significance.)
          2. Related to one of your examples in an earlier comment on this thread: I don’t see how one could ever justify the “equal interval” assumption in scoring rating scales. But I also don’t see why the assumption is needed, anyhow. My guess (please correct me if I’m wrong) is that it is made so “averages” can be interpreted. But this seems artificial to me. It would make more sense to me to say, “OK, we can’t really justify an “equal interval” assumption, but as long as the rating options are ordered, we can still make sense of the *median*, so let’s look at medians rather than means.” This would have a second advantage: If ratings turn out to be skewed (which I would guess they often do – floor and ceiling effects, etc.), then looking at the median makes much more sense than looking at the mean anyhow. (After all, the mean and median are the same for symmetric distributions, so whenever we do something based on a normal distribution assumption, we are just looking at the median. And I don’t know about you, but one thing I’ve always stressed when teaching intro stats is that for a skewed distribution, the “five number summary” is more meaningful than mean and standard deviation.)

        • Martha,

          I don’t have an answer to your first question, as that is not my research area. I do research in interpersonal perception of personality that uses the concept of “accuracy” but we implement it quite differently. You’d have to ask someone who works on stereotype threat if they have a good rationale for that method.

          As for your second question, I think the field has settled on using an interval assumption for rating scales because it is easy to implement in a wide range of circumstances and it works well enough most of the time. Making that assumption means you can score items the same way whether they are single items or parts of multi-item scales, whether they are predictors or response variables, you can analyze them with any least squares method, you can compare scores to the large body of existing research that scores things the same way, etc.

          My own observation is that statisticians (and quantitative psychologists) are often inclined to maximize the statistical correctness of a method, while researchers are willing to satisfice in order to get science done. An alternative way of working with rating scale data (or anything else) can’t just be better “on paper,” it has to be pragmatically better and by enough of a margin under a wide range of realistic circumstances that it’s worth abandoning the advantages of the current method. The psychometrics literature is full of little-cited articles that show some theoretical improvement and maybe demonstrate it in a simulation or toy dataset, but do not or cannot make a strong enough case that its application will lead to new scientific insights or solve previously unsolved problems. (And conversely, the articles that do do that tend to be transformative.)

        • Sanjay,
          Thanks for your reply. Your perspective seems quite different from what I am accustomed to hearing from the biologists and engineers I sometimes work with — they seem to be more oriented toward the belief that methods often need to be suited to particular circumstances, rather than being better under a wide range of realistic circumstances.

        • Martha,

          I do have the impression that at least some Sociologists would think similarly. To be honest I’m surprised that biologists belong to the “other side” here, my impressions so far were that they do think somewhat alike, that the only use “non-standard” methods if they already have shown to be of real practical use and difference in a few similar areas. I think it does have a lot to do with education. In all of these fields, even the applied quantitative researchers (not the methodologists) do not have a large amount of formal mathematical and/or statistical training. For them to learn and understand new methods is a lot of work and they don’t want to have to do that for every new problem. This is more of a hypothesis then a proven theory but it’s based on some experience talking with people from those fields and coming from Sociology myself (not sure, if i would really call myself one …). Maybe others can add their impressions!

        • Daniel,
          I guess my phrase “the biologists … I sometimes work with” was ambiguous. I intended it to refer just to those with whom I sometimes work (mostly evolutionary biologists), not to biologists in general. Bayesian methods, bootstraps, principal components analysis, and graphical analyses are often used there. Different areas of biology often require quite different statistical methods.

  4. Here’s a piece by Chabris in Slate on replicability in priming experiments:

    http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html

    My theoretical question is: Shouldn’t social psychology’s ideological bias in favor of social constructionism make it less surprising when priming experiments don’t replicate? If I say that my experiment in inducing Behavior X is the result of the universal genetic inheritance of humanity, and then you can’t replicate my results, well that makes my genetic determinism look bad.

    But if you say that most behaviors such as Behavior X are socially constructed by contingent factors that are highly malleable, and then somebody can’t replicate that Behavior X several years later in a different part of the country, well, why does that make me look bad? Maybe there was just a fad for doing Behavior X a few years ago and now the fad is over?

    For example, 100 years ago, I could have easily primed college students to say the word “Skidoo.” How? By saying “23” to them. To college students in 1914, 23 and Skidoo were inextricably linked. “23” was a surefire stimulus for the response “Skidoo.” Why? It was some socially constructed fad that all the young folks were crazy over.

    In 2014, however, not so much.

    A problem in Social Psychology is that the researchers won’t admit the full implications of their Social Constructionist ideology. They want the dignity of Scientists discovering Truths about Nature, like Newton and Einstein. They don’t want to be thought of as less nimble marketing researchers churning up trivia that will be outdated soon.

    • Heres’ a question: violence and the capacity for it surely must have a significant genetic component it’s relation to survival. So then how does a place like Iceland, which according to the Icelandic Sagas was one of the most violent societies on earth, go to being one of the most peaceful societies on earth?

      Was violence a social fad or did Icelander’s evolve?

      Also you might have missed this:

      http://statmodeling.stat.columbia.edu/2014/07/23/world-without-statistics/#comment-182828

    • Whether cultural associations change over time and for different groups of people is not really what’s at stake at all with the social priming research. Obviously cultural associations change. The theoretical motivation of social priming research is not to catalog different associated pairs. It is more basically the idea that such high-level priming effects — i.e., brief stimulus exposures influencing behavior in subtle ways, most typically circumventing the intentions/awareness of the person in question — can occur at all.

      Naturally many of the associations that generally held 100 years ago (“23” and “skidoo”) no longer hold today, but the claim here would be that social priming as a general phenomenon worked then and it works now, albeit on different concepts. And it is this more basic claim that is being contested in recent years. So whether a particular social priming effect will or will not replicate in distant times and places is not particularly interesting to either “side” of the debate (although if the idea is true then we would at least hope that it replicated in similar times and places). What *is* interesting is the possibility that *none* of the effects replicate, in *any* time or place, and that the original results themselves are not the results of complex social constructions but, rather more simply, statistical false alarms.

      • How is “priming” different from “marketing?”

        A large part of marketing consists of trying to influence consumers to give you their money with quick-hitting stimuli that aren’t necessarily consciously processed.

        For example, the first time I stuck my head in a Chipotle restaurant, the stereo was playing New Order’s 1982 classic “Temptation” and the typeface on the menu above the counter was some kind of trendy san-serif font that I associate with minimalist good taste:

        http://social-brain.com/tag/chipotle-marketing/

        In other words, Chipotle’s marketing gurus put in a lot of effort to reassure college educated not-so-young people like me that this is the kind of place that appeals to college educated not-so-young people like me.

        Most of the time, I can’t articulate the reasons for my impression of which demographic the marketing is aimed at as well as I could with Chipotle, but this kind of priming is a massive part of our economy. I can’t imagine that this kind of priming never works. As John Wanamaker said, I know I’m wasting half of my advertising budget, I just don’t know which half.

        • @Sailer:

          I think what’s happening here is that you are using an overly broad, personal definition of “priming”.

          Psychologists OTOH (I suspect) use the word to refer to a very specific effect. In a way grossly different from how Sailer is trying to use it.

          e.g. I doubt Chipotle’s appealing to a certain demographic is “priming” of the type that is of interest to a psychologist. The stimulus isn’t brief nor subtle & the subjects seem quite aware of it.

  5. Pingback: Briefly | Stats Chat

  6. Pingback: Friday links: highly significant increase in marginal significance, hidden female authors, the evolution of Groot, and more | Dynamic Ecology

  7. Pingback: "Surely our first response to the disproof of a shocking-but-surprising claim should be to be un-shocked and un-surprised, not to try to explain away the refutation" - Statistical Modeling, Causal Inference, and Social Science Statistical Modeli

Leave a Reply

Your email address will not be published. Required fields are marked *