I respond to E. J.’s response to our response to his comment on our paper responding to his paper

In response to my response and X’s response to his comment on our paper responding to his paper, E. J. writes:

Empirical claims often concern the presence of a phenomenon. In such situations, any reasonable skeptic will remain unconvinced when the data fail to discredit the point-null. . . . When your goal is to convince a skeptic, you cannot ignore the point-null, as the point-null is a statistical representation of the skeptic’s opinion. Refusing to discredit the point-null means refusing to take seriously the opinion of a skeptic. In academia, this will not fly.

I don’t know why E. J. is so sure about what will or not fly in academia, given that I’ve published a few zillion applied papers in academic journals while only very occasionally doing significance tests.

But, setting aside claims about things not flying, I agree with the general point that hypothesis tests can be valuable at times. See, for example, page 70 of this paper. Indeed, in our paper, we wrote, “We have no desire to ‘ban’ p-values. . . . in practice, the p-value can be demoted from its threshold screening role and instead be considered as just one among many pieces of evidence.” I think E. J.’s principle about respecting skeptics is consistent with what we wrote, that p-values can be part of a statistical analysis.

P.S. Also E. J. promises to blog on chess. Cool. We need more statistician chessbloggers. Maybe Chrisy will start a blog too. After all, there’s lots of great material he could copy. There’d be no need for plagiarize: Chrissy could just read the relevant material, not check it for accuracy, and then rewrite it in his own words, slap his name on it, and be careful not to give credit to the people who went to the trouble to compile the material themselves.

P.P.S. I happened to have just come across this relevant passage from Regression and Other Stories:

We have essentially no interest in using hypothesis tests for regression because we almost never encounter problems where it would make sense to think of coefficients as being exactly zero. Thus, rejection of null hypotheses is irrelevant, since this just amounts to rejecting something we never took seriously in the first place. In the real world, with enough data, any hypothesis can be rejected.

That said, uncertainty in estimation is real, and we do respect the deeper issue being addressed by hypothesis testing, which is assessing when an estimate is overwhelmed by noise, so that some particular coefficient or set of coefficients could just as well be zero, as far as the data are concerned. We recommend addressing such issues by looking at standard errors as well as parameter estimates, and by using Bayesian inference when estimates are noisy, as the use of prior information should stabilize estimates and predictions.

27 thoughts on “I respond to E. J.’s response to our response to his comment on our paper responding to his paper

  1. What I disagree with is the idea that the “skeptic’s” stance is that the effect is zero. Why is that necessarily true? Why couldn’t a skeptic just say “I doubt the effect is consistent, with it sometimes being positive or negative, and generally not large enough to really substantiate the theory”?

    • Stephen:

      My take on this is not that the skeptic necessarily believes there is zero effect, but rather that the skeptic says that, if your experiment is so noisy that you can’t rule out a zero effect, then it would be pretty hopeless to try to learn anything conclusive.

      One point that Blake and the others and I made in our article is that there’s no reason why science has to be conclusive: It should be fine to publish interesting results that are noisy, especially when decisions have to be made. But I agree with E. J. that the p-value, or statements like the p-value, have some relevance in understanding what can be learned from data.

      • Ok, I can understand that. That’s how I describe the nil-null p-value anyway to undergraduates (“If p < somethingTiny; then at least your estimate is discernible from zero to some degree"). But then again, I'm still not sure whether a p-value is necessary for that. A posterior would likewise indicate whether the plausible values are discernible from a nil-null zero-point value, so to speak.

        That's not to say the p-value is useless though. In one paper, we've used p-values, BFs (blech), EAPs, credible intervals, p(effect is positive | y), etc. A real hodge-podge, but all in the aim of giving different inferential information.

        My problem is that, ironically, I think for the point you make, the p-value is probably more useful to me than a BF or something similar. Saying "the effect is undiscernible from the null" is arguably more useful to me than trying to accumulate evidence for a hypothesis I already know is false — Even if H0 provides better predictions than the full expanse of H1, I can already say H0 is not probable at all, and therefore a BF01 in favor of H0 is misleading to me [it would account for .000000000000000….00001 effects better than wide prior statements, but would still be wrong].

      • “there’s no reason why science has to be conclusive: It should be fine to publish interesting results that are noisy”

        This point of view doesn’t translate to practically implementable advice. You write a paper, openly talking about its noisiness and the tentativeness of the claim, you get rejected. I don’t know the solution, I’m only pointing out that your statement above is not useful in practice. Maybe a high-profile MIT or Stanford or Columbia prof can get away with tentative conclusions (though of course they don’t they go for full metal jacket media attention), the average scientist, or the grad student new to the field, can’t. For these people, they have to deliver big news or disappear. So they just blow up their tentative findings out of proportion.

        • Not really. I am second author on a paper where we do more or less say, at least for a few effects, “We can’t really discern from this data whether X has an effect on Y; the direction is uncertain, and the plausible estimates vary from essentially useless to extremely large”.

          Didn’t need a p-value to say that either, although the p-value was also fairly large. I think more people are opening up to transparent, honest descriptions of the estimates. One reviewer even said they admired the openness and honesty about the effects, stating that most other papers seem to try to hack their way to a value, or fabricate a story to account for it. We didn’t do either; we fit the model we wanted, and we said exactly what we should: “Can’t tell; sorry folks. But here are the estimates for the other things we were interested in”.

        • Well, sure. I have also had those positive experiences. I am talking about the most frequent response. Try submitting to the top ranked journal in your field, and report an unclear result. See if the reviewers praise your honesty.

          Incidentally, i looked your cv and was initially amazed at your publication record. But then i realized most of them are not published, some don’t even exist (in preparation). I suggest distinguishing between journal articles that are published or accepted, and drafts. :)

    • Respectfully, if this is how you review papers it sounds like you are part of the problem. I have papers with all non significant effects published, others with directional posterior probabilities in which all but one interval included zero, and others where sensitivity analyses showed much uncertainty. It might be how you talk about these things in the paper, that is presented research without relying upon threshold to say something meaningful.

      • I don’t know what you mean by “this” in your first sentence. Maybe you mean that I demand strong claims as a reviewer? No, I don’t. If the claim is too strong given the data, I would suggest weakening it. Researchers routinely talk about an effect being present if p is 0.049 but absent when p is 0.051. Sometimes they argue effect is present if p is 0.051, if that’s what they want, and sometimes they argue effect is absent if p is 0.059, if that’s what they want. The evidence, even under frequentist thinking, is just not convincing, but they will publish top journal articles (which I am not a reviewer for) making their claims sound like big news.

        I’m talking about what I’ve seen happen to graduate students, often my own graduate students. The last time my grad student sent his paper with a tentative claim to a top journal, the editor desk-rejected it because it didn’t provide closure. This is what I mean when I say Andrew’s comment (although right) is practically useless. What is a grad student going to do? Do the right thing that Andrew suggests and be open about the problems in the claim they make based on the data, and they will never get a paper in. Meanwhile these grad students are competing for jobs with other grad students who have no qualms about blowing up their claims all out of proportion.

        I’m guessing your field is not cognitive science or cognitive psychology. What field do you work in? Because I want to work in an area that does sensitivity analyses and shows directional posterior probability :). At least point me to a paper so I can enjoy the existence of authors I can relate to statistically!

  2. The discussion could be much easier if we gave up talking about “phenomena” and the metaphysics of their realness, that supposedly can be determined by one statistical test. Science is not about generalizing phenomena but about testing theories or hypotheses. If researcher A (given s/he was not cheating) has found results that are supportive while researcher B’s findings lend no support, this divergence should trigger a critical debate that may result in new studies. Again, there is not direct route from data to truth. Data are persuasive arguments, and even if the statistics (p-level, power, effect size, etc.) contribute to this debate, they are not more cogent than substantive arguments, e.g. about systematic and causally relevant differences between the studies.

      • “> Science is not about generalizing phenomena but about testing theories or hypotheses.
        Is this true?”

        I came across multiple descriptions/definitions/views of terms like “science”, “generalizing”, “phenomena”, “theories”, and “hypotheses”, which makes it hard for me to try and check whether this statement is correct.

        Most importantly however, i have trouble understanding what Strack is actually saying or intending to convey with this sentence. What does this sentence even mean? Is Strack trying to say that a) the goal of science is not to maximize the generalizability of phenomena (which i think nobody disagrees with), or b) generalizability of phenomena is not important (or less important than hypotheses for instance) in science (which i kind of assume he roughly means to say and i disagree with, see my question that follows), or c) …

        That leaves me with a question: How can you propose a scientific hypothesis without assuming (some) generalizability of phenomena?

        For instance, if i understood Strack et al. (1988)’s “pen-in-mouth-paper” hypothesis correctly, it reads as follows:

        https://pdfs.semanticscholar.org/b2cf/930730bae01ced73ae10d4ecac244d4d9022.pdf

        “Specifically, potentially humorous stimuli should be rated least funny when the muscles associated with smiling are inhibited (lips condition),but should be rated most funny when this muscular activity is facilitated (teeth condition). With
        no manipulation of relevant facial muscles (nondominant hand condition), humor ratings should not be affected.” (p. 770)

        Prior to stating this hypothesis, several findings of other studies are listed, as is usually the case with scientific papers. Now it seems to me that these previous findings, and conclusions, are in a way “generalized” to provide evidence and arguments in order to come up with the specific hypothesis of his own new study. If this makes any sense, i would argue that a very important part of science actually is about “generalizing phenomena” (whereby “generalizing phenomena” is inter-preted to mean roughly something that i wrote above under b).

      • “> Science is not about generalizing phenomena but about testing theories or hypotheses.

        Is this true?”

        The scientific method, by definition, is about coming up with hypotheses and testing them.

        • I ask not _just_ to troll but because I also wonder how much of the whole reproducibility issue is related to some sciences/scientists adopting a pretty naive view of ‘the scientific method’.

        • You are absolutely right. The current replication debate is enitirely based on the implicit epistemological assumption that science consists of the generalization of experimental phenomena. If they are not “real” or not “robust”, they cannot be generalized to new contexts.

          I agree that when it comes to applied science, genrealization is important and robustness plays an important role (s. clinical trials). However, basic science is not about generalizing experimental phenomena. Instead, basic science is about finding the causes of phenomena on a more abstract theoretical level. The concrete derivations (aka experimental operationalisations) of more abstract theoretical concepts may be fragile and without any applied value. However, they may be dignostic to tell apart different underlying mechanisms.

          And with all due respect, that’s where I see the pencil procedure in relation to Darwin’s theory of facial feedback.

        • ojm said, “I also wonder how much of the whole reproducibility issue is related to some sciences/scientists adopting a pretty naive view of ‘the scientific method’”

          I agree. Something I read a number of years ago has stuck in my mind ever since; it said something like, “There are as many versions of the scientific method as there are scientists.” (I haven’t been able to locate the exact reference, but I remember that it was in a short publication put out by the Howard Hughes Institute.) It may be a slight exaggeration, but it does point to the need for caution in talking about”the scientific method.”

  3. I think the statement “won’t fly” is only true in psychology where statistics has taken the role of theory most of the time.
    In biology you can certainly publish a paper in highly respected journals not not include any statistical tests what so ever.

  4. E.J. is quoted as writing “When your goal is to convince a skeptic…” Perhaps one of the reasons that converting a skeptic may be impossible, whether it be about point nulls, climate change, gun control or evolution, is that the term “theory,” at least in the English language, is what is known as a contronym. “Peruse” and “sanction” are well known examples of English contronyms; there are many others.
    To the lay public, “theory” is a vague assertion, a belief, a speculation, a woolly conjecture; as such, one theory is no better or worse than any other theory. However, in the world of science, the highest accolade possible is to label something as a theory because among scientists, a theory is something which explains diverse phenomena which have occurred and furthermore, predicts the results of as yet unknown experiments; for example, quantum theory, theory of relativity, and of course, the contentious theory of evolution. Darwin is contentious because each side is using the same word, “theory” but with different meaning.
    An example of the frustration of misunderstanding even when data are forthcoming, is the notion of a “missing link” or a gap in the theory of evolution. The skeptic points to the gap in the “theory” but if the missing link is indeed found, the skeptic replies, “aha, now there are two gaps.” We currently live in Trumpian world where science of any kind, point nulls included, is under siege and lunatic theories are on equal footing with rational ones. And the former often has a bigger megaphone.

    https://www.mediamatters.org/blog/2017/10/06/alex-jones-week-irresponsible-las-vegas-shooting-conspiracy-theories/218163

    • Maybe, but I think it’s more likely that people take a look at the cost/benefit of conceding the truth of something… and decide it makes no sense to concede, why would they? How is it in their interest?

      Let’s say you’re a member of an industry that benefits from releasing carbon dioxide. If you concede that releasing carbon dioxide causes global warming… you lose money, it harms you. If you don’t concede you have a better chance of maintaining your cash cow. Even if you personally are convinced by the argument, you won’t concede the science.

      Too much is made of the “anti-science” attitudes. This is pure game theory. By continuing to frame the issue in terms of whether the science is right or not, instead of who benefits and who loses… the debate never even gets close to reality, and it tends to corrupt the truth as well.

      But, people don’t like to talk about values and cost/benefit.

    • I’m pretty skeptical of a lot of things, but believe what I ask for is pretty simple and basic.

      If you want me to believe some phenomenon exists or has a certain property, have multiple people/groups make measurements of it that are consistent with each other. The more independent these groups are the better, the best is if I myself can also make the observations.

      If you want me to believe your explanation for a phenomenon is correct, make precise predictions and compare them to observations collected later. It is very important the predictions are precise enough so that we can distinguish between multiple explanations, not just that x is positively/negatively correlated to y.

      I really don’t think this is anything more than being scientific. I also think that many of the “skeptics” on the topics you mention would be quite willing to believe astronomers if they claimed an asteroid was going to impact the Earth. This is simply because astronomy has such a great track record of predicting nearly exactly where stuff will be in the future, and this is easily verified by anyone.

  5. I went and read EJ’s blog entry (actually coauthored – EJ and Quentin), and noticed a nuance that’s perhaps worth noting. In the section “Choice 3: Is it useful to show a skeptic that the data discredit the point-null hypothesis?”, one of the people in the mock dialogue responds this way:

    “Rene looks at Joe and says, “Dude. Don’t you know? The point-null has gone to meet its maker. Nobody considers the point-null even approximately true anymore. We are only interested in estimating effect sizes and reporting confidence intervals, obviously taking great care never to mention whether zero is inside or outside of the interval. This is the era of the New Statistics.””

    And EJ and Quentin’s skeptic Joe argues back that he needs to see evidence that the point null is inaccurate etc. EJ and Q conclude, “Refusing to discredit the point-null means refusing to take seriously the opinion of a skeptic. In academia, this will not fly.”

    The nuance is that Rene says “obviously taking great care never to mention whether zero [the point null] is inside or outside of the interval”.

    Ignoring the point null in this way is silly (of course, it’s just a mock dialogue). And doing it this way ignores not just the point null but anything in the neighbourhood of the point null. So I agree with Joe, but only because I object to ignoring the skeptics, not because I think the skeptics need a p-value for the point null of their choice.

    One can address the skeptics (as EJ and Quentin say we should) by “estimating effect sizes and reporting confidence intervals” (Rene’s preferred methodology) … and simply also discussing what these imply for the skeptic’s favoured point null and its neighbourhood. (Of course, this is just another way of following some of the advice quoted by Andrew, namely that one should “[look] at standard errors as well as parameter estimates”.)

  6. A likelihood-only based test (as is typical for NHST) is implicitly saying that the skeptic’s belief isn’t any more likely a priori than any other possibility — eg, that an effect size of 0 and of 2.3 googolplexes are equally plausible a priori

    If I were a skeptic, I would not be happy with my beliefs being put on the same footing as a 2.3 googolplex effect size.

Leave a Reply to Shravan Cancel reply

Your email address will not be published. Required fields are marked *