I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.
In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right.
I was thinking about this after receiving the following email from a psychology student:
I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are randomized control designs with the same treatment and outcome measures, and each situation refers to a different treatment. It is easiest for me to put it into a table. In each of these situations only 1 of 2 published studies is found to be statistically significant.
Sig in diff
Situation 1 Study A
Treatment is effective
Situation 2 Study C
Unclear, needs more replications
Situation 3 Study E
Unclear, needs more replications
Situation 4 Study G
Null/needs more replications
Here, Situation 1 refers to 2 studies that have similar effects in magnitude, though the larger of the 2 studies (smaller se) is the only sig one. SInce the difference between the two effects is itself, not statistically significant, we should conclude treatment in situation 1 is effective (this seems to be in line with your paper).
In situation 2 there are 2 equally sized experiments that differ in treatment effect and significance. Since the difference between the estimates is statistically significant, one concludes the paradigm needs more replications.
In situation 3 the 2 studies have 2 effects, one is statistically significant while the other is not. However in this situation study F is neither statistically nor substantively significant. Unlike situation 1 it would seem unwise to conclude Treatment in situation 3 is effective and we need more replications.
Situation 4 is just some result I cam across in a research synthesis, where a smaller study (larger se) had a statistically sig effect, but a larger one did not. It would seem in this situation the true effect is null and the stat sig effect is a type 1 error. However the difference between studies is not stat sig, would this matter?
I replied that my quick reaction is that it would be better if there were data from more studies. With only two studies, your inference will necessarily depend on your prior information about effectiveness and variation of the treatments.
The student then wrote:
That is my reaction as well. Unfortunately sometimes the only data we have is from a small number of studies, and not enough to necessarily run a meta-analysis on. In addition, the hypothetical situations I sent you are sometimes all we know about the effectiveness and variation in treatments, because it is all the evidence we have. What I am trying to better understand is if your paper is addressing situation 1 ONLY, or if it is making inferences or statements about the evidence in the other situations I presented.
To which I replied that I don’t know that our paper gives any real recommendations. In a decision problem, I think ultimately it’s necessary to bite the bullet and decide what prior information you have on effectiveness rather than relying on statistical significance.
This is a problem under classical or Bayesian methods. Either way, it’s standard practice to summarize uncertainty in a way that encourages deterministic thinking.