Failure of failure to replicate

Dan Kahan tells this story:

Too much here to digest probably, but the common theme is—what if people start saying their work “replicates” or “fails to replicate” when the studies in question are massively underpowered &/or have significantly different design (& sample) from target study?

1. Kahan after discovering that authors claim my study “failed to replicate”:
 
On Thu, Aug 10, 2017 at 6:37 PM, Dan Kahan <[email protected]> wrote:
Hi, Steve & Cristina.
So predictably, people are picking up on your line that “[you failed to replicate Kahan et al.’s “motivated numeracy effect”.
As we have discussed, your study differed from ours in various critical respects, including N & variance of sample in numeracy & ideology. I think it is misleading to say one found no “replication” when study designs differ.  All the guidelines on replication make this point.

2. Them–acknowledging this point

co-author 2

Hi Dan,

If we didn’t, we should have said “conceptual replication.” I certainly agree we didn’t fail to replicate any specific study of yours. And we could have had a bigger N and more conservatives. That’s why we haven’t tried to publish the work in a journal, just a conference proceedings. But, as appealing as the hypothesis is, Cristina’s work does leave me with less faith in the general rule that more numerate people engage in more motivated reasoning using the contingency table task.

best, s

lead author:
Hi Dan,
I agree– we should have used a phrase other than “replication” in describing those parts of the results. 
To add, I tried to make it clear in our poster presentation, as well as in our paper, that the effect of reasons we found was not predicated on the existence of the motivated numeracy effect. And I explicitly noted that this null result was likely attributable to the differences between the two studies– in fact, many people I talked to pointed out the difference in N and the differences in variance on their own.
Cristina

3. Kahan—trying to figure out how they can acknowledge somewhere that their studies aren’t commensurable w/ ours & it was mistake to assert otherwise

Hi, Steve & Cristina.

Thanks for reflecting on my email & responding so quickly.
I am left, however, with the feeling that your willngness to acknowledge my points in private correspondence doesn’t address my objection to the fairness of what you have done in an open scholarly forum.
You have “published” your paper in the proceedings collection.  The abstract of your paper states expressly “we failed to replicate Kahan et al.’s ‘motivated numeracy effect.’ ” In the text you  state that  “you attempted to replicate” our study and “failed to find a significant effect of motivated numeracy.”
The perfectly forseeable result is that readers  are now treating your study as a “failed replication” attempt, notwithstanding your acknowledgement to me that such a conclusion “clearly,” “definitely” isn’t warranted. Expecting them to “figure this out” for themselves isn’t realistic given the limited time & attention span of casual readers, and the lure of the abstract.
I think the fair thing to do would be to remove the references to “failed replication” and to acknowledge in the paper that your design — because of the N and because of the lack of variance in ideology & numeracy in the study subjects — was not suited for testing the replicability of ours.
 
Anytning short of this puts me in the position of bearing the burden of explaining away your expressly stated conclusion that our study results “didn’t replicate.”  Because my advocacy would be discounted as self-serving, I would suffer an obvious rhetorical handicap.  What’s more, I’d be forced to spend considerable time on this at the expense of other projects I am working on.
Avoiding such unfairness informs the protocols for replication that various scholars and groups of scholars have compiled and that guided the Science article.  I’m sure you agree that this is good practice & hope will accommodate me & my co-authors on this.
–Dan
 
4. Co-author tells me I should feel “honored” that they examined by work & otherwise shut up; also, “replication” carries no special meaning that justifies my focus on it…

Dear Dan,
I will speak for myself, not Cristina.
You seem to have misunderstood my email. I am not taking back our claim that we failed to replicate. What I said is that I admitted that we could have characterized it as a failure of a “conceptual replication.” This is still a type of replication. We were testing an hypothesis we derived from your paper, we used a similar experimental procedure though a wildly smaller N, which we tried to counterbalance by giving each subject more tasks to do. So we had more data than you per subject. We also only tested half your hypothesis in the sense that we didn’t have many conservatives. Nevertheless, we fully expected to see the same pattern of results that you found. But we didn’t; we found the opposite. We were surprised and disappointed but nevertheless decided to report the data in a public forum. I stand by our report even if you don’t like one of our verbs.
Even if we wanted to, we couldn’t deliver on your request. The proceedings have been published. It’s too late to change them. But the fact is that I wouldn’t want to change them anyway. Yes, we could have added the word “conceptual” in a couple of places. But that wouldn’t change the gist of the story. There are failures to replicate all the time. Ours is a minor study, reported in a minor venue. If people challenge you because of it, I’m sure you’re smart enough and have enough data to meet the challenge. I think you should consider it an honor that we took the time and made the effort to look at one boundary of your effect. If you feel strongly about it, then feel free to go out and explain why our data look the way they do. Simply saying our N was too small and our population too narrow explains nothing. We found some very systematic effects, not just random noise.
all the best, steve

Just to interrupt here, I agree with Dan that this seems wrong.  Earlier, Steve had written, “I certainly agree we didn’t fail to replicate any specific study of yours,” and Cristina had written, “we should have used a phrase other than ‘replication’ in describing those parts of the results.”  But now Steve is saying:

I stand by our report even if you don’t like one of our verbs.

I guess the verb here is “replicate”—but at the very least it’s not just Dan who doesn’t think that word is appropriate.  It’s also Cristina, the first author of the paper!

The point here is not to do some sort of gotcha or to penalized Cristina in any way for being open in an email. Rather, it’s the opposite:  the point is that Kahan is offering to help Steve and Cristina out by giving them a chance to fix a mistake they’d made—just as, earlier, Steve and Cristina were helping Dan out by doing conceptual replications of his work.  It seems that those conceptual replications may have been too noisy to tell us much—but that’s fine too, we have to start somewhere.

OK, back to Dan:

5.  So Dan writes attached paper w/ co-author:
Futile gesture, no doubt.
Kahan concludes:
This is a case study in how replication can easily go off the rails. The same types of errors people make in non-replicated papers will now be used in replications.

19 thoughts on “Failure of failure to replicate

  1. I would have thought that a tenured prof that studies ‘probability judgment’ (Sloman) would do better than this. To be clear, when we refer to small samples here, Sloman’s paper is N = 55 (!!!!). having ‘more measures per subject’ isn’t going to resolve this. Yikes.

    • I haven’t read the Sloman paper, but the claim “having more measures per subject isn’t going to [help]” is incorrect. More observations per subject almost always helps to increase power. (The degree to which repeated measures increases power depends within-individual correlation across observations)

      • Yes, I’m well aware. But it doesn’t help for the specific context of this analysis and claimed replication, which is what I was referring to (i.e., the specific contents of the Sloman piece). I certainly didn’t mean to imply problems for the within subjects argument in principle.

  2. Looking at figure 2, I don’t think they replicated the results for he “high numeracy” case. They need to figure out why the estimates were different for “skin rash” and “identity affirmed gun” conditions (I have no idea what is being measurde, only glanced at the figures):
    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3026941

    Anyway, I think the word “replication” has become a meaningless buzzword in some corners of research so I wouldn’t worry too much about its misuse anymore. Even as part of that psych replication project there were some people that changed the methods for whatever reason…

    For now, “direct replication” still has meaning. However, someday I expect you will need: “Real actual direct replication wherein we attempted to follow the previous methods as faithfully as possible”.

    • Something I was thinking was that mayhaps the experimental treatment doesn’t affect the numeracy directly, but rather the participant’s criterion of choosing the correct answer. Now, this sort of analysis can’t be performed based on the data supplied in the article (HOX: based on eyeballing it rather quickly with my slimy eyeballs and ctr+F:ing with my sticky fingers), but to demonstrate here’s some R code–since who wouldn’t LOVE some R code!

      par(mfrow = c(1,2))
      curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,
      main = “High numeracy”)
      curve(dnorm(x, qnorm(0.8) * sqrt(2)), -3, 6, add = T)
      abline(v = 0, lty = 2)

      curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,
      main = “Low numeracy”)
      curve(dnorm(x, qnorm(0.5) * sqrt(2)), -3, 6, add = T)
      abline(v = 0, lty = 2)

      (The means are based on estimating the average number of correct answers from one of the figures, I forgot which one)

      Here the distributions on the left represent incorrect answers and the distributions on the right hand side represent distributions for the correct answers. From a statistical viewpoint, the subject gets a “sample” from both of the distributions, observes the difference and then responds based on an internal criterion–the dashed vertical lines.

      The “decisional scale” represents the subject’s, uh, internal feel about the correctness of the answer: higher numeracy skills will result in a larger difference between the modes of the distributions, indicating clearer distinction between correct and incorrect answers.

      In the figure presented here, the distributions in the low numeracy group overlap, indicating that they have no feel whatsoever about the correct answer.

      Anyway, without going into further details it should be clear that the proportion of correct answers could depend on two things: the “internal feel” (glah, why can’t I come up with a better term, damn flu) about the correct answer, i.e. what would be theoretically meaningful to call “numeracy skill” and the decisional criterion. Based on quick skimming of the paper I don’t see this possibility ruled out.

      • Also, something that popped into my bored and feverish brain is that any variability between subjects regarding the decisional criterion would in this sort of analysis lead into (a seeming) reduction of numeracy. The magnitude of this reduction is non-identifiable. So if for what ever reason the task structure or some internal processes would lead to more varied criteria between subjects, this would drag the numeracy scores in that group down.

      • Basically you are saying there could be the “numerate” and “heuristic” methods described in the paper, but also propose a “I have no idea what I’m looking at” method. I don’t see why that can’t happen too, but I suspect an important part of the result is that the “low numeracy” people are doing worse than 50/50.

        They show a table like (testing out a new formatting strategy here):


        Rash got worse Rash got better

        Did use cream 223 75

        Didn't use cream 107 21

        Most participants use a heuristic form of analysis. First, they compare the number of “successes” to the number of “failures” in the treatment group. They then compare the number of successes in the treatment group to the number of successes in the control group. If the number of successes in the treatment group exceeds both the number of failures in the treatment group and the number of successes in the control, people tend to classify the experiment as proof of the efficacy of the treatment. If not, they characterize the evidence as supporting the inference that the treatment was ineffective.

        To put this information in a less confusing format, I did (using a computer, not sure if this was available to the participants):
        a = 223/(223+75) ~ .75
        b = 107/(107+21) ~ .84

        Then I compared a > b, which is false. The heuristic is apparently to do 223 > 75 & 223 > 107, which is true. I suppose the former is the correct method and latter the incorrect method.

        Either way, I can't conclude whether the cream is helping or not from this info. To start with: Were the researchers blinded, how was rash got better/worse determined, what does it mean to "use" the cream, did the cream make the rash start going away but cause a breakout of zits instead so people stopped using it?

        So if they phrased the question like "does this prove the cream makes the rash get worse/better?" I would answer "no" regardless of the numbers. What exactly did they ask the subjects? I'm sure the answers to these questions would lead to more questions...

        It could be that critical thinking is triggered more often when the data appears to be "identity threatening". If there is nothing threatening about the conclusion people may fall back on the "numerate heuristic" of saying "a > b, the treatment works, the end". It is interesting that the supposed "correct answer" here seems to amount to statistical significance thinking.

        • I think you’ve misunderstood me a bit–and I do not blame you: looking back to what I wrote, it was quite messy and misleading.

          Now, let us first forget about heuristics for a moment, it doesn’t really matter what sort of strategy the participants are concretely using. Whatever heuristic they’re using leads them to having some degree of certainty about which option is the correct one: if they are using the correct heuristic, they’ll end up having some positive amount of certainty, and they’ll give—on average—more than 50 percent correct answers.

          This can be quantified with the following formula:

          curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,
          xlab = “Certainty that the correct answer is correct”)

          Indeed, when the “certainty” is zero, the participant doesn’t really have any idea what to do, they are as certain about both of the options, and will respond randomly. If the certainty is _negative_, then they’ll be more certain about the _incorrect_ option being correct and end up answering correctly less than 50 percent of the time. Conversely for positive values.

          This holds when we assume that the participants are unbiased. Let us instead assume that the participants aren’t unbiased. This means that the participants can be biased towards selecting one of the options, even against their “internal certainty”. The next figure will plot the behaviour of biased participants:

          curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,
          xlab = “Certainty that the correct answer is correct”, ylim = c(0, 1))
          curve(pnorm((x – 2) / sqrt(2)), -1, 1, add = T,
          col = “red”)
          curve(pnorm((x + 2) / sqrt(2)), -1, 1 , add = T,
          col = “blue”)
          abline(v = 0.7, lty = 2)

          In this figure the black line plots the behaviour of an unbiased participant, as before, but the red and blue lines plot the behaviour of biased participants. It is important to note, that the “internal certainty” for the participants represented by the red and blue lines is the same as for the participant represented by the black line: their probabilities for responding correctly are different only due to their decisional bias.

          The dashed vertical line plots a certain point on the “certainty” scale: indeed, even if the certainty stays the same, here 0.7, if the participant is biased towards one of the options, their probability of selecting the correct answer may be increased or decreased depending on the sign of the bias.

          In this way it is not necessarily the numeracy that is affected–which would be causally linked to the level of internal certainty–but the decisional processes, the bias of the participant.

          Now, I’m not suggesting this being the case; this is just something that popped into my mind.

        • I really don’t want to look deeper into this paper but didn’t they assess numeracy in some other way? So you would be proposing that people with this or that numeracy score also have this or that bias?

  3. One cause of this problem, in Geoff Cumming’s terms, is that people need to learn to think meta-analytically. I am reminded of a classic problem posed by Rosenthal and Rosnow that goes something like this: Researcher Jones does a study and rejects the null hypthesis, t(58) = 2.21, p = .03. Researcher Smith sets out to replicate this result and conducts another study. She fails to reject the null hypothesis, t(18) = 1.19, p = .25, and concludes that she “failed to replicate” Jones’s study. Which researcher is more likely to have reached the correct statistical conclusion? My students tend to have a hard time with this at first, but when I get them to think about the underlying effect sizes in these studies (they are essentially identical) it becomes easier (and always generates lots of interesting discussion). In part, this is why I think Cumming’s emphasis on estimation and confidence intervals is a much better approach than the way that statistics have traditionally been taught.

      • Thanks for the link that I will read more carefully as it seems very thoughtful and well researched but to shoot from hip as a blog comment here:

        With regard to apparent replication between two studies one can

        1: Compare intervals of parameters values that are compatible with the observations in each (Sander Greenland argues such intervals should be called compatibility intervals as the are actually overconfidence intervals).

        2: Compare intervals of parameters values that are most supported by the observations in each using a specific data generating model appropriate to each study, that is possibly differing data generating models or likelihoods for each perhaps averaged over the same prior (as I believe for assessing apparent replication the prior should be the same – that is background information should be taken to be common.)

        3: Do both 1 and 2 and worry a lot about all the assumptions involved, especially those about what was assumed common versus different between the two studies.

      • Unfortunate for all of us as a study in isolation is not science*

        This might be due from most learning statistics with reference to a single study – unlike Fisher who in his early writing often discussed issues in the context of multiple studies (even if just hypothetically).

        Also, I was very lucky in that in my first stats course headed by Don Fraser, the sections in his text on combining systems/studies/estimates by multiplying likelihoods versus combining unbiased estimates caught my interest (mostly by confusing me beyond measure at first).

        * quote from Peirce might suffice “I [Peirce] do not call the solitary studies of a single man a science. It is only when a group
        of men, more or less in intercommunication, are aiding and stimulating one another by their understanding of a particular group of studies as outsiders cannot understand them, that I call their life a science.” Understanding of a particular group of studies is key – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

  4. I’ve long been impressed with Kahan but the original paper he’s defending isn’t his most compelling work.

    To publish as a study an online survey of people making ~$40,000-$49,000 that, once analyzed, shows a 20% increase in numeracy but only among those in the top 90% of numerates (numeracy comes with a number from 1-10, who knew?) when the correct answer (~1/5 > ~1/6) correlates with their presumed(but unmeasured)political biases, is to invite a replication attempt. And even if N in the replication attempt was just a fraction of Kahan’s 1,111 I think it’s pretty fair evidence that the “motivated numeracy” effect is at best small-ish and variable – which, in fact, would be entirely consistent with the original findings. Recall that for most people even when their presumably cherished beliefs were at risk their numeracy score was a better predictor of their accuracy than their biases.

Leave a Reply to Pointeroutguy Cancel reply

Your email address will not be published. Required fields are marked *