The blogger known as Neuroskeptic writes:

Can the thought of money make people more conservative?

The idea that mere reminders of money can influence people’s attitudes and behaviors is a major claim within the field of social priming – the study of how our behavior is unconsciously influenced by seemingly innocuous stimuli. However, social priming has been controversial lately with many high profile failures to replicate the reported effects.

Now, psychologists Doug Rohrer, Hal Pashler, and Christine Harris have joined the skeptical fray, in a paper soon to be published in the Journal of Experimental Psychology: General (JEPG).

Rohrer et al. report zero evidence for money-priming effects across four large experiments. They conclude that “Although replication failures should be interpreted with caution, the sheer number of so many high-powered replication failures cast doubt on the money priming effects reported . . .”

Neuroskeptic continues:

Each of the four experiments was a replication of one of the experiments in a previous study, Caruso et al. (2013). . . . However, Rohrer et al. report that they couldn’t replicate any of the four effects they looked for.

The above graph summarizes the original study and the unsuccessful replication:

In the original studies (red bars), the ‘money’ condition (bright) produced increases in the various behaviors compared to the control condition (dark). In Rohrer et al.’s replications (blue), there were no differences by condition. The error bars are smaller for the blue bars too, reflecting the replications’ higher sample sizes.

And there were several other failed replications. (See the linked post for details.)

So it all seems pretty clear. I have no reason to believe in this effect. And, to the extent it is happening, the effect could vary: it could be positive in some scenarios, negative in others, large in some places, small in others, etc. No evidence for any sort of universal effect; the explanations all devolve into contextual stories, which tell us nothing more than what we already knew, which is that lots of factors influence individual behavior and attitudes.

**Just one thing, though . . .**

Neuroskeptic concludes:

Rohrer et al. say that they don’t have any explanation for the positive findings in Caruso et al. The results are unlikely to be due to publication bias, they say. The effects are too strong, and highly unlikely to occur by chance, even taking into account that there were unpublished null results too.

Rohrer et al. also reject the idea that methodological differences could account for the failures to replicate. But this leaves us with the question of what is going on here. Hmm.

Can’t it just be the garden of forking paths? Lots of choices in data processing and coding, multiple outcomes, various ways that an analysis can lead to a “p less than 0.05” comparison.

As I’ve said before, I worry that concerns about the “file-drawer effect” (unpublished studies) and “fishing” or “p-hacking” (intentional searching for statistical significance) miss the elephant in the room, which is the garden of forking paths—that is, data-processing and analysis choices that are contingent on data, hence making the statement “p less than 0.05” essentially meaningless.

This is not at all to say that all or even most studies reporting significant effects are mistaken. Rather, what I’m saying is that if a study reports statistical significance, and if that study contradicts theory and the rest of the literature (in this case, unsuccessful replications), then it’s not such a mystery that statistical significance was attained.

No need to think of this as a loose end that needs to be followed up.

Rather, it’s standard operating procedure: if there’s nothing going on (or, more precisely, a highly variable and context-dependent effect) that’s being studied by researchers using standard statistical methods with a strong motivation to find statistically significant p-values, then it’s no surprise at all that such p-values were found. It tells us pretty much nothing at all, especially in the context of a bunch of unsuccessful replications. So no puzzle, no “Hmm” required, I think.

Perhaps a simple way to regard these kind of results is to reflect that, given a garden of forking paths, the results are not truly distributed in a Gaussian distribution. In fact, we can’t really know what the distribution is. But we do know one thing, which is that Chebyshev’s inequality will hold. Knowing only this, and for largish sample sizes (the limits become even broader for small samples), the 95% confidence interval becomes close to 4.5 (sample) standard deviations).

If this criterion were to be applied, for example, to the results shown at the top of this post, there would be no support for claiming an effect. If the details of the experiment/processing/forking paths gave one confidence that the results were distributed normally, then one could narrow the confidence band. One way to get such confidence would be for someone else to repeat the experiment and only process the data in the same way as the published result.

Cherry picking, data snooping, peeking, tuning on the signal, use constructing, looking for the pony, data dredging, post-data subgroups, look elsewhere effects, optional stopping, monster barring, ad hoc saves, multiple testing, Texas sharp-shooting, non-novel data, significance seeking, verification bias, post-specification, outcome switching and many other terms besides, are all variants of a cluster of different gambits that alter the error probing capacities of tests. An umbrella term might be biasing selection effects (or, if you prefer, garden of forking paths.) So it isn’t that any one term is missing what really matters. These gambits have been discussed by scientists and philosophers for centuries. I don’t object to new terms (like p-hacking), but it’s a very ancient (and extremely important) issue in inductive-statistical inference. The most important (and difficult) question–one I’ve tried to answer for quite a while–is when data (or hypothesis) dependent selections vitiate inferences…. and when they do not. And why.

Why is the prior tweaking aspect of Bayesian modeling not a variant of data dependent selection? Or is it?

I may be wrong but for the past years I’ve been getting the impression that the garden of forking paths, fishing, p-hacking, and the file drawer effect are all rolled together into the label “publication bias”. I agree that this is imprecise but it definitely appears to be happening. Certainly, some attempts to assess “publication bias” do not really address the file-drawer effect of non-significant/non-exciting findings but are almost certainly skewed by all of these practices.

Publication bias only refers to a practice of not publishing non-significant results, shelving them in file drawers. I wouldn’t call it a biasing selection effect: It doesn’t alter the warranted inference, only whether it sees the light of day.

It might or might not be, depending on how the tweaking is done. What is not often said is that any probabilities obtained by applying a prior to a result make the final reported probability a joint probability that *both* the prior *and* the test result would (or would not) happen by chance. But how could we possibly know the “probability” of the prior? Maybe we can’t even know what that would mean.

And if the choice of prior has been affected by looking at the results, then the two are correlated. And we know that correlations reduce apparent variance. After all, that is why convolution smoothing techniques exist – to smooth be reducing variance through correlation.