I really don’t want to look deeper into this paper but didn’t they assess numeracy in some other way? So you would be proposing that people with this or that numeracy score also have this or that bias?

]]>Yes. Actually this is a big worry in replication–replication doesn’t ensure science when the original study AND its replication are both weak, but the word “replication” may fool people otherwise.

]]>Yes, I’m well aware. But it doesn’t help for the specific context of this analysis and claimed replication, which is what I was referring to (i.e., the specific contents of the Sloman piece). I certainly didn’t mean to imply problems for the within subjects argument in principle.

]]>I haven’t read the Sloman paper, but the claim “having more measures per subject isn’t going to [help]” is incorrect. More observations per subject almost always helps to increase power. (The degree to which repeated measures increases power depends within-individual correlation across observations)

]]>+1. The “replication” attempt was weak, but so was the original study.

]]>Unfortunate for all of us as a study in isolation is not science*

This might be due from most learning statistics with reference to a single study – unlike Fisher who in his early writing often discussed issues in the context of multiple studies (even if just hypothetically).

Also, I was very lucky in that in my first stats course headed by Don Fraser, the sections in his text on combining systems/studies/estimates by multiplying likelihoods versus combining unbiased estimates caught my interest (mostly by confusing me beyond measure at first).

* quote from Peirce might suffice “I [Peirce] do not call the solitary studies of a single man a science. It is only when a group

of men, more or less in intercommunication, are aiding and stimulating one another by their understanding of a particular group of studies as outsiders cannot understand them, that I call their life a science.” Understanding of a particular group of studies is key – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

I think you’ve misunderstood me a bit–and I do not blame you: looking back to what I wrote, it was quite messy and misleading.

Now, let us first forget about heuristics for a moment, it doesn’t really matter what sort of strategy the participants are concretely using. Whatever heuristic they’re using leads them to having some degree of certainty about which option is the correct one: if they are using the correct heuristic, they’ll end up having some positive amount of certainty, and they’ll give—on average—more than 50 percent correct answers.

This can be quantified with the following formula:

curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,

xlab = “Certainty that the correct answer is correct”)

Indeed, when the “certainty” is zero, the participant doesn’t really have any idea what to do, they are as certain about both of the options, and will respond randomly. If the certainty is _negative_, then they’ll be more certain about the _incorrect_ option being correct and end up answering correctly less than 50 percent of the time. Conversely for positive values.

This holds when we assume that the participants are unbiased. Let us instead assume that the participants aren’t unbiased. This means that the participants can be biased towards selecting one of the options, even against their “internal certainty”. The next figure will plot the behaviour of biased participants:

curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,

xlab = “Certainty that the correct answer is correct”, ylim = c(0, 1))

curve(pnorm((x – 2) / sqrt(2)), -1, 1, add = T,

col = “red”)

curve(pnorm((x + 2) / sqrt(2)), -1, 1 , add = T,

col = “blue”)

abline(v = 0.7, lty = 2)

In this figure the black line plots the behaviour of an unbiased participant, as before, but the red and blue lines plot the behaviour of biased participants. It is important to note, that the “internal certainty” for the participants represented by the red and blue lines is the same as for the participant represented by the black line: their probabilities for responding correctly are different only due to their decisional bias.

The dashed vertical line plots a certain point on the “certainty” scale: indeed, even if the certainty stays the same, here 0.7, if the participant is biased towards one of the options, their probability of selecting the correct answer may be increased or decreased depending on the sign of the bias.

In this way it is not necessarily the numeracy that is affected–which would be causally linked to the level of internal certainty–but the decisional processes, the bias of the participant.

Now, I’m not suggesting this being the case; this is just something that popped into my mind.

]]>Yes learning to think meta-analytically is not easy for most.

]]>Thanks for the link that I will read more carefully as it seems very thoughtful and well researched but to shoot from hip as a blog comment here:

With regard to apparent replication between two studies one can

1: Compare intervals of parameters values that are compatible with the observations in each (Sander Greenland argues such intervals should be called compatibility intervals as the are actually overconfidence intervals).

2: Compare intervals of parameters values that are most supported by the observations in each using a specific data generating model appropriate to each study, that is possibly differing data generating models or likelihoods for each perhaps averaged over the same prior (as I believe for assessing apparent replication the prior should be the same – that is background information should be taken to be common.)

3: Do both 1 and 2 and worry a lot about all the assumptions involved, especially those about what was assumed common versus different between the two studies.

]]>Basically you are saying there could be the “numerate” and “heuristic” methods described in the paper, but also propose a “I have no idea what I’m looking at” method. I don’t see why that can’t happen too, but I suspect an important part of the result is that the “low numeracy” people are doing worse than 50/50.

They show a table like (testing out a new formatting strategy here):

Rash got worse Rash got better

Did use cream 223 75

Didn't use cream 107 21

Most participants use a heuristic form of analysis. First, they compare the number of “successes” to the number of “failures” in the treatment group. They then compare the number of successes in the treatment group to the number of successes in the control group. If the number of successes in the treatment group exceeds both the number of failures in the treatment group and the number of successes in the control, people tend to classify the experiment as proof of the efficacy of the treatment. If not, they characterize the evidence as supporting the inference that the treatment was ineffective.

To put this information in a less confusing format, I did (using a computer, not sure if this was available to the participants):

a = 223/(223+75) ~ .75

b = 107/(107+21) ~ .84

Then I compared a > b, which is false. The heuristic is apparently to do 223 > 75 & 223 > 107, which is true. I suppose the former is the correct method and latter the incorrect method.

Either way, I can't conclude whether the cream is helping or not from this info. To start with: Were the researchers blinded, how was rash got better/worse determined, what does it mean to "use" the cream, did the cream make the rash start going away but cause a breakout of zits instead so people stopped using it?

So if they phrased the question like "does this prove the cream makes the rash get worse/better?" I would answer "no" regardless of the numbers. What exactly did they ask the subjects? I'm sure the answers to these questions would lead to more questions...

It could be that critical thinking is triggered more often when the data appears to be "identity threatening". If there is nothing threatening about the conclusion people may fall back on the "numerate heuristic" of saying "a > b, the treatment works, the end". It is interesting that the supposed "correct answer" here seems to amount to statistical significance thinking.

]]>I agree in principle (i.e., thinking meta-analytically), but unfortunately traditional confidence intervals based on long run sampling are intimately connected to p-values, and don’t really mean what most researchers think they mean. See this paper by Jeff Rouder and colleagues for example: https://www.ncbi.nlm.nih.gov/pubmed/26450628

]]>Also, something that popped into my bored and feverish brain is that any variability between subjects regarding the decisional criterion would in this sort of analysis lead into (a seeming) reduction of numeracy. The magnitude of this reduction is non-identifiable. So if for what ever reason the task structure or some internal processes would lead to more varied criteria between subjects, this would drag the numeracy scores in that group down.

]]>Something I was thinking was that mayhaps the experimental treatment doesn’t affect the numeracy directly, but rather the participant’s criterion of choosing the correct answer. Now, this sort of analysis can’t be performed based on the data supplied in the article (HOX: based on eyeballing it rather quickly with my slimy eyeballs and ctr+F:ing with my sticky fingers), but to demonstrate here’s some R code–since who wouldn’t LOVE some R code!

par(mfrow = c(1,2))

curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,

main = “High numeracy”)

curve(dnorm(x, qnorm(0.8) * sqrt(2)), -3, 6, add = T)

abline(v = 0, lty = 2)

curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,

main = “Low numeracy”)

curve(dnorm(x, qnorm(0.5) * sqrt(2)), -3, 6, add = T)

abline(v = 0, lty = 2)

(The means are based on estimating the average number of correct answers from one of the figures, I forgot which one)

Here the distributions on the left represent incorrect answers and the distributions on the right hand side represent distributions for the correct answers. From a statistical viewpoint, the subject gets a “sample” from both of the distributions, observes the difference and then responds based on an internal criterion–the dashed vertical lines.

The “decisional scale” represents the subject’s, uh, internal feel about the correctness of the answer: higher numeracy skills will result in a larger difference between the modes of the distributions, indicating clearer distinction between correct and incorrect answers.

In the figure presented here, the distributions in the low numeracy group overlap, indicating that they have no feel whatsoever about the correct answer.

Anyway, without going into further details it should be clear that the proportion of correct answers could depend on two things: the “internal feel” (glah, why can’t I come up with a better term, damn flu) about the correct answer, i.e. what would be theoretically meaningful to call “numeracy skill” and the decisional criterion. Based on quick skimming of the paper I don’t see this possibility ruled out.

]]>To publish as a study an online survey of people making ~$40,000-$49,000 that, once analyzed, shows a 20% increase in numeracy but only among those in the top 90% of numerates (numeracy comes with a number from 1-10, who knew?) when the correct answer (~1/5 > ~1/6) correlates with their presumed(but unmeasured)political biases, is to invite a replication attempt. And even if N in the replication attempt was just a fraction of Kahan’s 1,111 I think it’s pretty fair evidence that the “motivated numeracy” effect is at best small-ish and variable – which, in fact, would be entirely consistent with the original findings. Recall that for most people even when their presumably cherished beliefs were at risk their numeracy score was a better predictor of their accuracy than their biases.

]]>https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3026941

Anyway, I think the word “replication” has become a meaningless buzzword in some corners of research so I wouldn’t worry too much about its misuse anymore. Even as part of that psych *replication project* there were some people that changed the methods for whatever reason…

For now, “direct replication” still has meaning. However, someday I expect you will need: “Real actual direct replication wherein we attempted to follow the previous methods as faithfully as possible”.

]]>