Comments on: Failure of failure to replicate

By: Anoneuoid

Anoneuoid — Fri, 13 Apr 2018 23:55:03 +0000

In reply to Naglob the Eggplant. I really don't want to look deeper into this paper but didn't they assess numeracy in some other way? So you would be proposing that people with this or that numeracy score also have this or that bias?

By: ZC

ZC — Fri, 13 Apr 2018 20:37:51 +0000

In reply to Pointeroutguy.

Yes. Actually this is a big worry in replication–replication doesn’t ensure science when the original study AND its replication are both weak, but the word “replication” may fool people otherwise.

By: Dan C

Dan C — Thu, 12 Apr 2018 23:08:47 +0000

In reply to David. Yes, I'm well aware. But it doesn't help for the specific context of this analysis and claimed replication, which is what I was referring to (i.e., the specific contents of the Sloman piece). I certainly didn't mean to imply problems for the within subjects argument in principle.

By: dmk38

dmk38 — Thu, 12 Apr 2018 22:25:07 +0000

An informative & fair-minded assessment here.

By: David

David — Thu, 12 Apr 2018 17:24:56 +0000

In reply to Dan C. I haven't read the Sloman paper, but the claim "having more measures per subject isn't going to [help]" is incorrect. More observations per subject almost always helps to increase power. (The degree to which repeated measures increases power depends within-individual correlation across observations)

By: Pointeroutguy

Pointeroutguy — Thu, 12 Apr 2018 16:36:26 +0000

In reply to Thanatos Savehn. +1. The "replication" attempt was weak, but so was the original study.

By: Keith O'Rourke

Keith O'Rourke — Thu, 12 Apr 2018 13:53:27 +0000

In reply to Sameera Daniels.

Unfortunate for all of us as a study in isolation is not science*

This might be due from most learning statistics with reference to a single study – unlike Fisher who in his early writing often discussed issues in the context of multiple studies (even if just hypothetically).

Also, I was very lucky in that in my first stats course headed by Don Fraser, the sections in his text on combining systems/studies/estimates by multiplying likelihoods versus combining unbiased estimates caught my interest (mostly by confusing me beyond measure at first).

* quote from Peirce might suffice “I [Peirce] do not call the solitary studies of a single man a science. It is only when a group
of men, more or less in intercommunication, are aiding and stimulating one another by their understanding of a particular group of studies as outsiders cannot understand them, that I call their life a science.” Understanding of a particular group of studies is key – https://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

By: Naglob the Eggplant

Naglob the Eggplant — Thu, 12 Apr 2018 13:30:21 +0000

In reply to Anoneuoid.

I think you’ve misunderstood me a bit–and I do not blame you: looking back to what I wrote, it was quite messy and misleading.

Now, let us first forget about heuristics for a moment, it doesn’t really matter what sort of strategy the participants are concretely using. Whatever heuristic they’re using leads them to having some degree of certainty about which option is the correct one: if they are using the correct heuristic, they’ll end up having some positive amount of certainty, and they’ll give—on average—more than 50 percent correct answers.

This can be quantified with the following formula:

curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,
xlab = “Certainty that the correct answer is correct”)

Indeed, when the “certainty” is zero, the participant doesn’t really have any idea what to do, they are as certain about both of the options, and will respond randomly. If the certainty is _negative_, then they’ll be more certain about the _incorrect_ option being correct and end up answering correctly less than 50 percent of the time. Conversely for positive values.

This holds when we assume that the participants are unbiased. Let us instead assume that the participants aren’t unbiased. This means that the participants can be biased towards selecting one of the options, even against their “internal certainty”. The next figure will plot the behaviour of biased participants:

curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,
xlab = “Certainty that the correct answer is correct”, ylim = c(0, 1))
curve(pnorm((x – 2) / sqrt(2)), -1, 1, add = T,
col = “red”)
curve(pnorm((x + 2) / sqrt(2)), -1, 1 , add = T,
col = “blue”)
abline(v = 0.7, lty = 2)

In this figure the black line plots the behaviour of an unbiased participant, as before, but the red and blue lines plot the behaviour of biased participants. It is important to note, that the “internal certainty” for the participants represented by the red and blue lines is the same as for the participant represented by the black line: their probabilities for responding correctly are different only due to their decisional bias.

The dashed vertical line plots a certain point on the “certainty” scale: indeed, even if the certainty stays the same, here 0.7, if the participant is biased towards one of the options, their probability of selecting the correct answer may be increased or decreased depending on the sign of the bias.

In this way it is not necessarily the numeracy that is affected–which would be causally linked to the level of internal certainty–but the decisional processes, the bias of the participant.

Now, I’m not suggesting this being the case; this is just something that popped into my mind.

By: Sameera Daniels

Sameera Daniels — Thu, 12 Apr 2018 12:50:02 +0000

In reply to Jeff Valentine. Yes learning to think meta-analytically is not easy for most.

By: Keith O'Rourke

Keith O'Rourke — Thu, 12 Apr 2018 12:00:35 +0000

In reply to Dan C.

Thanks for the link that I will read more carefully as it seems very thoughtful and well researched but to shoot from hip as a blog comment here:

With regard to apparent replication between two studies one can

1: Compare intervals of parameters values that are compatible with the observations in each (Sander Greenland argues such intervals should be called compatibility intervals as the are actually overconfidence intervals).

2: Compare intervals of parameters values that are most supported by the observations in each using a specific data generating model appropriate to each study, that is possibly differing data generating models or likelihoods for each perhaps averaged over the same prior (as I believe for assessing apparent replication the prior should be the same – that is background information should be taken to be common.)

3: Do both 1 and 2 and worry a lot about all the assumptions involved, especially those about what was assumed common versus different between the two studies.

By: Anoneuoid

Anoneuoid — Thu, 12 Apr 2018 08:16:04 +0000

In reply to Naglob the Eggplant. Basically you are saying there could be the "numerate" and "heuristic" methods described in the paper, but also propose a "I have no idea what I'm looking at" method. I don't see why that can't happen too, but I suspect an important part of the result is that the "low numeracy" people are doing worse than 50/50. They show a table like (testing out a new formatting strategy here):


              Rash got worse      Rash got better


Did use cream         223                 75


Didn't use cream      107                 21





Most participants use a heuristic form of analysis. First, they compare the number of “successes” to the number of “failures” in the treatment group. They then compare the number of successes in the treatment group to the number of successes in the control group. If the number of successes in the treatment group exceeds both the number of failures in the treatment group and the number of successes in the control, people tend to classify the experiment as proof of the efficacy of the treatment. If not, they characterize the evidence as supporting the inference that the treatment was ineffective.


To put this information in a less confusing format, I did (using a computer, not sure if this was available to the participants):
a = 223/(223+75) ~ .75
b = 107/(107+21) ~ .84

Then I compared a > b, which is false. The heuristic is apparently to do 223 > 75 & 223 > 107, which is true. I suppose the former is the correct method and latter the incorrect method. 

Either way, I can't conclude whether the cream is helping or not from this info. To start with: Were the researchers blinded, how was rash got better/worse determined, what does it mean to "use" the cream, did the cream make the rash start going away but cause a breakout of zits instead so people stopped using it? 

So if they phrased the question like "does this prove the cream makes the rash get worse/better?" I would answer "no" regardless of the numbers. What exactly did they ask the subjects? I'm sure the answers to these questions would lead to more questions...

It could be that critical thinking is triggered more often when the data appears to be "identity threatening". If there is nothing threatening about the conclusion people may fall back on the "numerate heuristic" of saying "a > b, the treatment works, the end". It is interesting that the supposed "correct answer" here seems to amount to statistical significance thinking.



By: Dan C
Dan C — Wed, 11 Apr 2018 21:24:13 +0000
In reply to Jeff Valentine.
I agree in principle (i.e., thinking meta-analytically), but unfortunately traditional confidence intervals based on long run sampling are intimately connected to p-values, and don’t really mean what most researchers think they mean. See this paper by Jeff Rouder and colleagues for example: https://www.ncbi.nlm.nih.gov/pubmed/26450628



By: Naglob the Eggplant
Naglob the Eggplant — Wed, 11 Apr 2018 21:20:15 +0000
In reply to Naglob the Eggplant.

Also, something that popped into my bored and feverish brain is that any variability between subjects regarding the decisional criterion would in this sort of analysis lead into (a seeming) reduction of numeracy. The magnitude of this reduction is non-identifiable. So if for what ever reason the task structure or some internal processes would lead to more varied criteria between subjects, this would drag the numeracy scores in that group down.


By: Naglob the Eggplant
Naglob the Eggplant — Wed, 11 Apr 2018 20:25:17 +0000
In reply to Anoneuoid.
Something I was thinking was that mayhaps the experimental treatment doesn’t affect the numeracy directly, but rather the participant’s criterion of choosing the correct answer. Now, this sort of analysis can’t be performed based on the data supplied in the article (HOX: based on eyeballing it rather quickly with my slimy eyeballs and ctr+F:ing with my sticky fingers), but to demonstrate here’s some R code–since who wouldn’t LOVE some R code!
par(mfrow = c(1,2))

curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,

      main = “High numeracy”)

curve(dnorm(x,  qnorm(0.8) * sqrt(2)), -3, 6, add = T)

abline(v = 0, lty = 2)
curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,

      main = “Low numeracy”)

curve(dnorm(x,  qnorm(0.5) * sqrt(2)), -3, 6, add = T)

abline(v = 0, lty = 2)
(The means are based on estimating the average number of correct answers from one of the figures, I forgot which one)
Here the distributions on the left represent incorrect answers and the distributions on the right hand side represent distributions for the correct answers. From a statistical viewpoint, the subject gets a “sample” from both of the distributions, observes the difference and then responds based on an internal criterion–the dashed vertical lines.
The “decisional scale” represents the subject’s, uh, internal feel about the correctness of the answer: higher numeracy skills will result in a larger difference between the modes of the distributions, indicating clearer distinction between correct and incorrect answers.
In the figure presented here, the distributions in the low numeracy group overlap, indicating that they have no feel whatsoever about the correct answer. 
Anyway, without going into further details it should be clear that the proportion of correct answers could depend on two things: the “internal feel” (glah, why can’t I come up with a better term, damn flu) about the correct answer, i.e. what would be theoretically meaningful to call “numeracy skill” and the decisional criterion. Based on quick skimming of the paper I don’t see this possibility ruled out.



By: Thanatos Savehn
Thanatos Savehn — Wed, 11 Apr 2018 19:34:00 +0000
I’ve long been impressed with Kahan but the original paper he’s defending isn’t his most compelling work.
To publish as a study an online survey of people making ~$40,000-$49,000 that, once analyzed, shows a 20% increase in numeracy but only among those in the top 90% of numerates (numeracy comes with a number from 1-10, who knew?) when the correct answer (~1/5 > ~1/6) correlates with their presumed(but unmeasured)political biases, is to invite a replication attempt. And even if N in the replication attempt was just a fraction of Kahan’s 1,111 I think it’s pretty fair evidence that the “motivated numeracy” effect is at best small-ish and variable – which, in fact, would be entirely consistent with the original findings. Recall that for most people even when their presumably cherished beliefs were at risk their numeracy score was a better predictor of their accuracy than their biases.



By: Jeff Valentine
Jeff Valentine — Wed, 11 Apr 2018 19:10:09 +0000
One cause of this problem, in Geoff Cumming’s terms, is that people need to learn to think meta-analytically. I am reminded of a classic problem posed by Rosenthal and Rosnow that goes something like this: Researcher Jones does a study and rejects the null hypthesis, t(58) = 2.21, p = .03. Researcher Smith sets out to replicate this result and conducts another study. She fails to reject the null hypothesis, t(18) = 1.19, p = .25, and concludes that she “failed to replicate” Jones’s study. Which researcher is more likely to have reached the correct statistical conclusion? My students tend to have a hard time with this at first, but when I get them to think about the underlying effect sizes in these studies (they are essentially identical) it becomes easier (and always generates lots of interesting discussion). In part, this is why I think Cumming’s emphasis on estimation and confidence intervals is a much better approach than the way that statistics have traditionally been taught.



By: Anoneuoid
Anoneuoid — Wed, 11 Apr 2018 15:07:26 +0000
Looking at figure 2, I don’t think they replicated the results for he “high numeracy” case. They need to figure out why the estimates were different for “skin rash” and “identity affirmed gun” conditions (I have no idea what is being measurde, only glanced at the figures):

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3026941
Anyway, I think the word “replication” has become a meaningless buzzword in some corners of research so I wouldn’t worry too much about its misuse anymore. Even as part of that psych replication project there were some people that changed the methods for whatever reason…
For now, “direct replication” still has meaning. However, someday I expect you will need: “Real actual direct replication wherein we attempted to follow the previous methods as faithfully as possible”.



By: Dan C
Dan C — Wed, 11 Apr 2018 13:52:23 +0000
I would have thought that a tenured prof that studies ‘probability judgment’ (Sloman) would do better than this. To be clear, when we refer to small samples here, Sloman’s paper is N = 55 (!!!!). having ‘more measures per subject’ isn’t going to resolve this. Yikes.



By: Sameera Daniels
Sameera Daniels — Wed, 11 Apr 2018 13:41:58 +0000
Wow one of my favorite Twitter connections: Dan Kahan I can’t wait to read through this subject.