To understand the replication crisis, imagine a world in which everything was published.

John Snow points me to this post by psychology researcher Lisa Feldman Barrett who reacted to the recent news on the non-replication of many psychology studies with a contrarian, upbeat take, entitled “Psychology Is Not in Crisis.”

Here’s Barrett:

An initiative called the Reproducibility Project at the University of Virginia recently reran 100 psychology experiments and found that over 60 percent of them failed to replicate — that is, their findings did not hold up the second time around. . . .

But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works. . . . Science is not a body of facts that emerge, like an orderly string of light bulbs, to illuminate a linear path to universal truth. Rather, science (to paraphrase Henry Gee, an editor at Nature) is a method to quantify doubt about a hypothesis, and to find the contexts in which a phenomenon is likely. Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.

All this is fine. Indeed, I’ve often spoken of the fractal nature of science: at any time scale, whether it be minutes or days or years, we see a mix of forward progress and sudden shocks, realizations that much of what we’ve thought was true, isn’t. Scientific discovery is indeed both wonderful and unpredictable.

But Barrett’s article disturbs me too, for two reasons. First, yes, failure to replicate is a feature, not a bug—but only if you respect that feature, if you take the failure to replicate to reassess your beliefs. But if you just complacently say it’s no big deal, then you’re not taking the opportunity to learn.

Here’s an example. The recent replication paper by Nosek et al. had many examples of published studies that did not replicate. One example was described in Benedict Carey’s recent New York Times article as follows:

Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

Carey got a quote from the author of that original study. To my disappointment, the author did not say something like, “Hey, it looks like we might’ve gone overboard on that original study, that’s fascinating to see that the replication did not come out as we would’ve thought.” Instead, here’s what we got:

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

“Theory-required adjustments,” huh? Unfortunately, just about anything can be interpreted as theory-required. Just ask Daryl Bem.

We can actually see what the theory says. Philosopher Deborah Mayo went to the trouble to look up Bressan’s original paper, which said the following:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men.

Nothing at all about Italians there! Apparently this bit of theory requirement wasn’t apparent until after the replication didn’t work.

What if the replication had resulted in statistically significant results in the same direction as expected from the earlier, published paper? Would Bressan have called up the Replication Project and said, “Hey—if the results replicate under these different conditions, something must be wrong. My theory requires that the model won’t work with American college students!” I really really don’t think so. Rather, I think Bressan would call it a win.

And that’s my first problem with Barrett’s article. I feel like she’s taking a heads-I-win, tails-you-lose position. A successful replication is welcomed as a confirmation, an unsuccessful replication indicates new conditions required for the theory to hold. Nowhere does she consider the third option: that the original study was capitalizing on chance and in fact never represented any general pattern in any population. Or, to put it another way, that any true underlying effect is too small and too variable to be measured by the noisy instruments being used in some of those studies.

As the saying goes, when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

My second problem with Barrett’s article is at the technical level. She writes:

Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. . . . Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true only under certain conditions [emphasis in the original].

At one level, there is nothing to disagree with here. I don’t really like the presentation of phenomena as “true” or “false”—pretty much everything we’re studying in psychology has some effect—but, in any case, all effects vary. The magnitude and even the direction of any effect will vary across people and across scenarios. So if we interpret the phrase “the phenomenon is true” in a reasonable way, then, yes, it will only be true under certain conditions—or, at the very least, vary in importance across conditions.

The problem comes when you look at specifics. Daryl Bem found some comparisons in his data which, when looked in isolation, were statistically significant. These patterns did not show up in replication. Satoshi Kanazawa found a correlation between beauty in sex ratio in a certain dataset. When he chose a particular comparison, he found p less than .05. What do we learn from this? Do we learn that, in the general population, beautiful parents are more likely to have girls? No. The most we can learn is that the Journal of Theoretical Biology can be fooled into publishing patterns that come from noise. (His particular analysis was based on a survey of 3000 people. A quick calculation using prior information on sex ratios shows that you would need data on hundreds of thousands of people to estimate any effect of the sort that he was looking for.) And then there was the himmicanes and hurricanes study which, ridiculous as it was, falls well within the borders of much of the theorizing done in psychology research nowadays. And so on, and so on, and so on.

We could let Barrett off the hook on the last quote above because she does qualify her statement with, “If the studies were well designed and executed . . .” But there’s the rub. How do we know if a study was well designed and executed? Publication in Psychological Science, or PPNAS is not enough—lots and lots of poorly designed and executed studies appear in these journals. It’s almost as if the standards for publication are not just about how well designed and executed a study is, but also about how flashy are the claims, and whether there is a “p less than .05” somewhere in the paper. It’s almost as if reviewers often can’t tell whether a study is well designed and executed. Hence the demand for replication, hence the concern about unreplicated studies, or studies that for mathematical reasons are essentially dead on arrival because the noise is so much greater than the signal.

Imagine a world in which everything was published

A close reading of Barrett’s article reveals the centrality of the condition that studies be “well designed and executed,” and lots of work by statisticians and psychology researchers in recent years (Simonsohn, Button, Nosek, Wagenmakers, etc etc) has made it clear that current practice, centered on publication thresholds (whether it be p-value or Bayes factor or whatever), won’t do so well at filtering out the poorly designed and executed studies.

To discourage or disparage or explain away failed replications is to give a sort of “incumbency advantage” to published claims, which puts a burden on the publication process that it cannot really handle.

To better understand what’s going on here, imagine a thought experiment where everything is published, where there’s no such thing as Science or Nature or Psychological Science or JPSP or PPNAS; instead, everything’s published on Arxiv. Every experiment everyone does. And with no statistical significance threshold. In this world, nobody has ever heard of inferential statistics. All we see are data summaries, regressions, etc., but no standard errors no posterior probabilities, no p-values.

What would we do then? Would Barrett reassure us that we shouldn’t be discouraged by failed replications, that everything already published (except, perhaps, for “a few bad eggs”) be taken as likely to be true? I assume (hope) not. The only way this sort of reasoning can work is if you believe the existing system screens out the bad papers. But the point of various high-profile failed replications (for example, in the field of embodied cognition) is that, no, the system does not work so well. This is one reason the replication movement is so valuable, and this is one reason I’m so frustrated by people who dismiss replications or who claim that replications show that “the system works.” It only works if you take the information from the failed replications (and the accompanying statistical theory, which is the sort of thing that I work on) and do something about it!

As I wrote in an earlier discussion on this topic:

Suppose we accept this principle [that published results are to be taken as true, even if they fail to be replicated in independent studies by outsiders]. How, then, do we treat an unpublished paper? Suppose someone with a Ph.D. in biology posts a paper on Arxiv (or whatever is the biology equivalent), and it can’t be replicated? Is it ok to question the original paper, to treat it as only provisional, to label it as unreplicated? That’s ok, right? I mean, you can’t just post something on the web and automatically get the benefit of the doubt that you didn’t make any mistakes. Ph.D.’s make errors all the time (just like everyone else). . . .

Now we can engage in some salami slicing. According to Bissell (as I interpret here), if you publish an article in Cell or some top journal like that, you get the benefit of the doubt and your claims get treated as correct until there are multiple costly, failed replications. But if you post a paper on your website, all you’ve done is make a claim. Now suppose you publish in a middling journal, say, the Journal of Theoretical Biology. Does that give you the benefit of the doubt? What about Nature Neuroscience? PNAS? Plos-One? I think you get my point. A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time. Sure, approval by 3 referees or 6 referees or whatever is something, but all they did is read some words and look at some pictures.

It’s a strange view of science in which a few referee reports is enough to put something into a default-believe-it mode, but a failed replication doesn’t count for anything.

I’m a statistician so I’ll conclude with a baseball analogy

Bill James once wrote with frustration about humanist-style sportswriters, the sort of guys who’d disparage his work and say they didn’t care about the numbers, that they cared about how the athlete actually played. James’s response was that if these sportswriters really wanted to talk baseball, that would be fine—but oftentimes their arguments ended up having the form: So-and-so hit .300 in Fenway Park one year, or so-and-so won 20 games once, or whatever. His point was that these humanists were actually making their arguments using statistics. They were just using statistics in an uninformed way. Hence his dictum that the alternative to good statistics is not “no statistics,” it’s “bad statistics.”

That’s how I feel about the people who deny the value of replications. They talk about science and they don’t always want to hear my statistical arguments, but then if you ask them why we “have no choice but to accept” claims about embodied cognition or whatever, it turns out that their evidence is nothing but some theory and a bunch of p-values. Theory can be valuable but it won’t convince anybody on its own; rather, theory is often a way to interpret data. So it comes down to the p-values.

Believing a theory is correct because someone reported p less than .05 in a Psychological Science paper is like believing that a player belongs in the Hall of Fame because hit .300 once in Fenway Park.

This is not a perfect analogy. Hitting .300 anywhere is a great accomplishment, whereas “p less than .05” can easily represent nothing more than an impressive talent for self-delusion. But I’m just trying to get at the point that ultimately it is statistical summaries and statistical models that are being used to make strong (and statistical ridiculous) claims about reality, hence statistical criticisms, and external data such as come from replications, are relevant.

If, like Barrett, you want to dismiss replications and say there’s no crisis in science: Fine. But then publish everything and accept that all data are telling you something. Don’t privilege something that happens to have been published once and declare it true. If you do that, and you follow up by denying the uncertainty that is revealed by failed replications (and was earlier revealed, on the theoretical level, by this sort of statistical analysis), well, then you’re offering nothing more than complacent happy talk.

P.S. Fred Hasselman writes:

I helped analyze the replication data of the Bressan & Stranieri study.

There were two replication samples:

›Original effect is a level comparison after a 2x2x2 ANOVA:
›F(1, 194) = 7.16, p = .008, f = 0.19
t(49) = 2.45, p = .02, Cohen’s d = 0.37

›Replication 1 in-lab with N=263, Power > 99%, Cohen’s d = .06
›Replication 2 on-line with N=317, Power > 99%, Cohen’s d = .09

Initially I did not have the time to read the entire article. I recently did, because I wanted to use the study as an example in a lecture.

I completely agree with the comparisons to Bem-logic.
What I ended up doing is showing the original materials and elaborating on the theory behind the hypothesis during the lecture.

After seeing the stimuli, learning about the hypothesis, but before learning about the replication studies, there was a consensus among students (99% female) that claims like the first sentence of the abstract should disqualify the study as a serious work of science:

ABSTRACT—Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extrapair mating with the former.

Think about it.
Men of higher genetic quality are poorer partners and parents.
That’s a fact you know.
And this genetic quality of men (yes, they mean attractiveness) is why women want their babies, more so than babies from their current partner (the ugly variety of men, but very sweet and good with kids).

My brain hurts.

Thankfully the conclusion is very modest:
In humans’ evolutionary past, the switch in preference from less to more sexually accessible men associated with each ovulatory episode would have been highly adaptive. Our data are consistent with the idea that, although the length of a woman’s reproductive lifetime and the extent of the potential mating network have expanded considerably over the past 50,000 years, this unconscious strategy guides women’s mating choices still.

Erratum: We meant ‘this unconscious strategy guides Italian women’s mating choices still’.


64 thoughts on “To understand the replication crisis, imagine a world in which everything was published.

  1. It’s not enough for all the papers to be published on arXiv. We also need to somehow break the symbiosis between authors, institutions, journals, and the media, all of whom do very nicely out of the “Gladwellization” of psychology.

    Incidentally, the German Psychological Association has done some numbers and spun the Replication Project’s numbers from 36% to 68% success, a figure that they seem quite happy with. So apparently there’s no problem anyway.

    Meanwhile, it didn’t take long for normal service to be resumed. Psychological Science has a study out today suggesting that feeling “blue” affects how you see the colour “blue”. And guess what? The main result (Study 2) depends on the difference between “significant” and “non-significant” being significant…

  2. I’m pretty sure Barrett also drew a weird analogy between a failure to replicate and a failure to uphold theoretical predictions in empirical work. I think there’s a big difference between a theoretical result not panning out in an empirical test and an empirical result not holding up to a peer replication.

        • The key is to be very open about your expectations; this quote is also from Psychological Science’s press release:

          “We were surprised by how specific the effect was, that color was only impaired along the blue-yellow axis,” says Thorstenson. “We did not predict this specific finding, although it might give us a clue to the reason for the effect in neurotransmitter functioning.”

        • That’s good, but there could be a lot of things different about the people in the two groups or watching lion king vs something else that may explain an average difference of 1/24 color identification errors… That effect size isn’t really large enough to rule out anything.

          Also, I plotted BY_ACC (blue-yellow accuracy) vs SAD_ESRI (sadness score) from Study2Data.xlsx and didn’t see a relationship I could make sense of, so I think that is even weaker than the group difference shown in the paper. Another thing is that the scores after watching lion king differed by ~20% from study 1 to study 2, so something else is going on that is ~5x stronger than the group difference.

        • I think SAD_ESRI is an independent test of how manipulable the subject is. You have to multiply by the actual manipulation. see my post below. I could be wrong of course. In any case I agree with you, there is nothing to write home about here.

        • They describe it like this in the paper:
          “the target emotion focused on in this experiment was sadness. The response scale ranged from 0 (not even the slightest bit of this emotion) to 8 (the most you have ever felt in your life).”

          Three people reported a score of 8 after the lion king scene in the study 2 data I looked at…

        • I should say that sharing the data so I could look at it myself does already make this paper superior to 99% of what gets published in medicine/biomed, imo.

        • I started thinking I should also mention that there is absolutely no problem with finding a small or no effect between groups, at all.

          There is a problem though, with concluding that sadness was the cause of the difference in color distinguishing between groups. This was done, as shown by these quotes from the paper:
          “That sadness influenced chromatic judgments about colors on the blue-yellow axis, but not those on the red-green axis, is important for two reasons.
          Our research is the first to show that sadness, a commonly experienced core emotion, has a direct negative influence on higher-order color perception.”

  3. Great (though depressing) post. You note “Nowhere does she consider the third option: that the original study was capitalizing on chance and in fact never represented any general pattern in *any* population.” I’m very often struck by this when reading terrible papers. (Equally accurate: I’m very often struck by this.) Don’t people realize that noise exists? After asking myself this a lot, I’ve concluded that the answer is no, at least at the intuitive level that is necessary to do meaningful science. This points to a failure in how we train students in the sciences. (Or at least, the not-very-quantitative sciences, which actually are quantitative, though students don’t want to hear that.)

    If I measured the angle that ten twigs on the sidewalk make with North, plot this versus the length of the twigs, and fit a line to it, I wouldn’t get a slope of zero. This is obvious, but I increasingly suspect that it isn’t obvious to many people. What’s worse, if I have some “theory” of twig orientation versus length, and some freedom to pick how many twigs I examine, and some more freedom to prune (sorry) outliers, I’m pretty sure I can show that this slope is “significantly different” from zero. I suspect that most of the people we rail against in this blog have never done an exercise like this, and have also never done the sort of quantitative lab exercises that one does repeatedly in the “hard” sciences, and hence they never absorb an intuition for noise, sample sizes, etc. (Feel free to correct me if you disagree.) This “sense” should be a pre-requisite for adopting any statistical toolkit. If it isn’t, delusion and nonsense are the result.

    I intended to write something like this in the post on the interesting high school statistics class, since it would be great for the kids to actually go out and fit “noise,” but I didn’t get around to it.

    • I love the idea of the twig exercise. It would be really cool to have students do this in small groups independently of one another on different days or in different locations. There would be all kinds of measurement choices for the groups to make in addition to sampling ones. For example, the angle between which part of the twig and north? One of the ends (sometimes there are more than two) or somewhere in the middle? And similar issues come up in the length measurements. It would be interesting to see how varied the slope estimates are. And then have them try to “explain” those differences by doing something like mining weather data (e.g., wind speed and direction) for the sample sites for the days preceding sampling. They could probably come up with all sorts of plausible sounding theories to explain the differences in slope estimates…. After that would be a good time for discussion of the different choices made in measurement, sampling, and initial analysis by each group and whether any of their explanations hold up once those differences are taken into account. I’d think one could almost base a semester course on this….

      • Good idea; in addition, taking a calculus and linear algebra class followed by a graduate level course or three in a statistics department would definitely help in clearing up their muddy understanding of basic concepts in the dark art of YHST.

  4. Andrew: I responded to an anonymous comment that echoes the NYT article you cite wherein failed replications are attributed to context-dependencies of the effect.
    I hope your post indicates you’re coming around to the view that at least some of these experiments are demonstrably illicit. They are certainly not licensed by Fisherian significance tests, which require much more for causal inference. The types of experimental “treatments” could valuably be studied, some of them flat out refuted, rather than keeping to the superficial statistical level. i don’t think psychology is be prepared to question fundamental assumptions of their field, so don’t expect them to entertain your apt titles; but maybe outsiders could.

  5. I can’t tell if social-psych research is just an egregious case of ‘most people are below average at what they do’*, or whether there are structural flaws, or whether people who feel compelled to actually get a degree in social psych represent a heavy case of selection bias. Are the bulk of the people currently doing psychological research paragons of understanding human nature? Not in my experience.

    *assuming that ability is exponentially distributed

    Barett writes:
    “Much of science still assumes that phenomena can be explained with universal laws and therefore context should not matter. But this is not how the world works. Even a simple statement like “the sky is blue” is true only at particular times of day, depending on the mix of molecules in the air as they reflect and scatter light, and on the viewer’s experience of color.”

    This makes me want to laugh tragically – it’s the least scientific thing I’ve heard all day! “The sky is blue” is a naive, categorical statement. Both the word “sky” and the word “blue” are abstractions that hide real variation in the world. The fact that this is “not how the world works” is a reflection of our minds and how we think, not some castigation of universal laws. This is an example of what I call the Mind Projection Fallacy.

    Also, when dealing with people, who cares if A is ‘true’ under ‘certain conditions’? You can fool yourself into thinking anything is “true” in the lab. As universal as possible, generalizable phenomenon is what we should be after. Everyone is focusing on ‘testing their hypotheses’. It’s the romantic notion of science. But genius type people that inspire that romantic notion of science never have one good idea. They almost always have ways of navigating conceptual space that are more efficient. Perhaps that should be the definition of ‘genius’. And whether you take a confirmatory or dis-confirmatory approach to investigating hypotheses (a la Popper), this still represents a testing strategy confined to a small area in conceptual / hypothesis space. The whole point is to distinguish between *multiple* hypotheses.

    So, yes. The fact that this piece appeared in the NY times signals that psychology, if not in trouble, is a very young field. Pre-newton, if I had to put my finger on it.

    • Remember, the silly articles discussed on blogs/NYT and the like are also the product of selection bias. Careful, detailed work isn’t as exciting to “take down” or gush about to the public.

    • EW: granted that striving for universal applicability may be what we do, and accepting that results can be locally but not universally valid opens troublesome researcher degrees of freedom (Bresson certainly looks for such an opening), is it not a real problem in the social sciences that people routinely make claims of universality that their data will not support?

      For a discussion of this in psychology, see Henrich, Heine & Norenzayan, “The Weirdest people in the world”, Behavioral & Brain Sciences 2010 doi: 10.1017/S0140525X0999152X (WEIRD is their acronym for Western, Educated, Industrialized, Rich, and Democratic, characteristics of the societies from which first year psych undergrads tend to come).

      It is also a central problem anywhere randomized controlled trials are used – as in economics, public health, and many areas of policy analysis: RCTs can produce excellent, clean experimental results, but do so at the expense of generalizability. See, for instance, Nancy Cartwright, “What are randomised controlled trials good for?”, Philosophical Studies, 2010, doi 10.1007/s11098-009-9450-2; Cartwright & Munro, “The limitations of randomized controlled trials in predicting effectiveness”, J. Evaluation in Clinical Practice, 2010, doi: 10.1111/j.1365-2753.2010.01382.x.

  6. First, Lisa Feldman Barrett was trained as a clinical psychologist, not a social psychologist, and so criticisms of her training and culture might be more accurately aimed.

    Further, I think that you will find that social psychologists found some value in what she said, and cringed at other points.

    It is unassailable that virtually every effect in psychology or any human science is moderated by a large number of factors (e.g., language, culture, brain functioning, mood, threat, age, gender, and the like). But there is absolutely no excuse for a cavalier assumption that different results are due to unmeasured moderators. You better have a good theoretical account for why (and evolution DOES happen to Italians) there is a moderator, and should have some evidence that the moderator matters.

  7. With every such article of Andy’s, my belief is strengthened that an extremely useful heuristic for assessing the validity psychological papers is “does the hypothesis sound stupid?”

      • My main problem in my own research has been that usually the hypothesis doesn’t sound stupid (given what we already know and given theory) but rather the problem is that it is hugely underspecified. Detailed commitments to what exactly is claimed is usually left open, allowing all possible counterexamples to be accommodated. People don’t say: OK, I have a vague theory, let me do a computational implementation to express it as a process model. Instead, decades of research can be based no nothing more than a sophisticated sounding hand-wave, nice words strung together to give the illusion of making sense.

        • I’d add that they leave out important side conditions, keeping them as free variables in the theory, allowing for too wide a range of possibilities. Even if they identified the side conditions and discussed predictions conditional on specific values for these, I think it would be OK.

  8. Great post. I like your style – fellow Orwell fan.

    I like this sentence : “Theory can be valuable but it won’t convince anybody on its own; rather, theory is often a way to interpret data.”
    But it should be changed to :
    “Theory can be valuable but it won’t convince anybody on its own; rather, theory is the only way to interpret data.”
    Everyone has a theory (even if often hidden or not articulated) when they interpret data. There is just no way around that.

  9. Andrew you write, “It’s almost as if the standards for publication are not just about how well designed and executed a study is, but also about how flashy are the claims, and whether there is a ‘p less than .05’ somewhere in the paper.”

    I would eliminate “almost as if” from the above sentence. And I would argue that this lousy standard isn’t a problem in just psychology, but in all the social sciences. At top-tier journals, it is common to be desk rejected because your theory isn’t flashy and your findings aren’t counterintuitive or somehow surprising. Top journals want to publish “groundbreaking” work–whatever that is–and research that presents merely incremental but solid advances doesn’t cut it. So scholars, in search of tenure and promotion, respond by doing increasingly questionable work to satisfy the absurd, and mostly impossible, standards set at the big journals. This problem isn’t going away until we completely alter the norms of publishing. I’m not going to hold my breath.

  10. I couldn’t agree more when you say
    ” ‘p less than .05’ can easily represent nothing more than an impressive talent for self-delusion”.

    I do think, though, that statisticians have to bear some of the responsibility for this sad state of affairs. The fact that the P value doesn’t tell you what you want to know barely ever features in the sort of statistic classes that have been given to non-statisticians for many decades. And the myth of P=0.05 is propagated in many papers which have professional statisticians as co-authors.

    One problem for experimenters is that if you ask statisticians about the false positive rate, they tend to immediately engage in the eternal internecine warfare between frequentists and Bayesians.

    I believe that the argument doesn’t need to be difficult or contentious. I tried to set it forth in “An investigation of the false discovery rate and the misinterpretation of p-values”

    • David:

      Yes, I’ve talked about this:

      When we as statisticians see researchers making strong conclusions based on analyses affected by selection bias, multiple comparisons, and other well-known threats to statistical validity, our first inclination might be to throw up our hands and feel we have not been good teachers, that we have not done a good enough job conveying our key principles to the scientific community.

      But maybe we should consider another, less comforting possibility, which is that our fundamental values have been conveyed all too well and the message we have been sending—all too successfully—is that statistics is a form of modern alchemy, transforming the uncertainty and variation of the laboratory and field measurements into clean scientific conclusions that can be taken as truth.

      That quote is from this article with Eric Loken.

      • Thanks very much indeed.

        I’d be enormously grateful if you could give me your opinion on my approach to the problem. The outcome is much the same as that of James Berger and of Valen Johnson. but the argument is much simpler.

        As far as I can tell, the only important assumption is that it’s appropriate to use a point null. Certainly all elementary teaching refers to a point null. Furthermore, wearing my experimental hat, it is what I want to test. If I can convince myself that an effect is unlikely to be exactly zero, then I go on to estimate the effect size and judge whether it is big enough to matter in practice.

        • >”If I can convince myself that an effect is unlikely to be exactly zero, then I go on to estimate the effect size and judge whether it is big enough to matter in practice.”

          You need to estimate the effect size to get the p-value though. You have already done that and do not need to “go on” to do it.

        • “If I can convince myself that an effect is unlikely to be exactly zero,”

          Do you know of any physical action you can do on human beings that has exactly zero effect?

        • What disaster would befall if you skipped the “convince myself that an effect is unlikely to be exactly zero” and went straight to estimating its size with a range of values consistent with the evidence, one of which might be zero?

        • I’m also a bit curious to know how a p-value tells you whether “an effect is unlikely to be exactly zero”.

          Let’s say the experimental z value for iid normals is 1.9 instead of 2 (you could use other values, but these are common and the point is the same no matter what). Are you saying if 1.9 occurs then for typical test this implies “the effect is likely to be exactly zero”, but if 2 occurs then “the effect is unlikely to be zero”?

        • Oh yes, very easily. You give both groups the same pill.
          Or, equivalently, you give one group a dummy pill and the other group homeopathic pill.
          The argument that the effect size can never be exactly zero has always struck me as an irrelevant quibble (and,in the examples that I gave, not true).

        • You have to look into the actual homeopathic methods, not their ridiculous theories or the lazy debunking that is usual. Each step the container is shook vigorously and they take either the very top layer or what remains sticking to the sides to “dilute” in the next step. This is not a random sample of the solution so the dilution calculations used to debunk it do not apply. Instead the concentration of solute appears to asymptote and excess components of the container may end up in there as well (possibly acting as a stabilizer):

          This will not be the same as placebo (which usually means sugar pill).

          Disclaimer: I am in no way a fan or user of homeopathy, but that is what evidence is actually available.

        • Huh? you’re the one who said “If I can convince myself that an effect is unlikely to be exactly zero,” so if you already know they’re not exactly zero, then why can’t you just skip that step and move on to estimating the size?

          It’s not an irrelevant quibble, it’s the entire issue.

        • If I ignore the testing you say people wearing “experimentalist hats” want to do, and jump to estimating ranges of plausible values consistent with the evidence, will I suffer any consequences other than saving all the time and hassle of learn, developing, teaching and using Null Hypothesis tests?

Leave a Reply

Your email address will not be published. Required fields are marked *