John Schwenkler organized a discussion on this hot topic, featuring posts by

– Dan Benjamin, Jim Berger, Magnus Johannesson, Valen Johnson, Brian Nosek, and E. J. Wagenmakers

– Felipe De Brigard

– Kenny Easwaran

– Andrew Gelman and Blake McShane

– Kiley Hamlin

– Edouard Machery

– Deborah Mayo

– “Neuroskeptic”

– Michael Strevens

– Kevin Zollman.

Many of the commenters have interesting things to say, and I recommend you read the entire discussion.

The one point that I think many of the discussants are missing, though, is the importance of design and measurement. For example, Benjamin et al. write, “Compared to using the old 0.05 threshold, maintaining the same level of statistical power requires increasing sample sizes by about 70%.” I’m not disputing the math, but I think that sort of statement paints much too optimistic a picture. Existing junk science such as himmicanes and air rage, or ovulation and voting and clothing, or the various fmri and gay-gene studies that appear regularly in the news, will not be saved by increasing sample size by 70% or 700%. Larger sample size might enable researchers to more easily reach those otherwise elusive low p-values but I don’t see this increasing our reproducible scientific knowledge. Along those likes, Kiley Hamlin recommends going straight to full replications, which would have the advantage of giving researchers a predictive target to aim at. I like the idea of replication, rather than p-values, being a goal. On the other hand, again, p-values are noisy, and none of this is worth anything if measurements are no good.

So one thing I wish more of the discussants had talked about is that, when applied to junk science—and all of this discussion is in large part the result of the cancerous growth of junk science within the scientific enterprise—the effect of new rules on p-values etc. will be *indirect*. Requiring p less than 0.005, or requiring Bayes factors, abandoning statistical significance entirely, or anything in between: none of these policies will turn work such as power pose or beauty-and-sex-ratio or the work of the Cornell University Food and Brand Lab into reproducible science. All it will do is possibly (a) make such work harder to publish as is, and (b) as a consequence of that first point, motivate researchers to better science, to design more targeted studies with better measurements so as to be able to succeed in the future.

It’s good goal to aim for (a) and (b), so I’m glad of all this discussion. But I think it’s important to emphasize that all the statistical analysis and statistical rule-giving in the world can’t transform bad data into good science. So I’m a bit concerned about messages implying that with a mere increase of sample size by a factor of 1.7 or 2, that reproducibility problems will be solved. At some point, good science requires good design and measurement.

There’s an analogy to approaches to education reform that push toward high standards, toward not letting students graduate unless their test scores reach some high threshold. Ending “social promotion” from grade to grade in school might be a good idea in itself, and in the right environment it might motivate students to try harder at learning and schools to try harder at teaching—but, by themselves, standards are just an indirect tool. At some point the learning has to happen. This analogy is not perfect—for one thing, a p-value is not a measure of effect size, and null hypothesis significance testing addresses an uninteresting model of zero effect and zero systematic error, kind of like if an educational test did not even attempt to measure mastery, instead merely trying to demonstrate that the amount learned was not exactly zero—but my point in the present post is to emphasize the essentially indirect nature of any procedural solutions to research problems.

Again we can consider that hypothetical study attempting to measure the speed of light by using a kitchen scale to weighing an object before and after it is burned: it doesn’t matter what p-value is required, this experiment will never allow us to measure the speed of light. The best we can do with rules is to make it more difficult and awkward to claim that such a study can give definitive results, and thus dis-incentivize people from trying to perform, publish, and promote such work. Substitute ESP or power pose or fat arms and voting or himmicanes etc. in the above sentences and you’ll get the picture.

As Blake and I wrote in the conclusion of our contribution to the above-linked discussion:

Looking forward, we think more work is needed in designing experiments and taking measurements that are more precise and more closely tied to theory/constructs, doing within-person comparisons as much as possible, and using models that harness prior information, that feature varying treatment effects, and that are multilevel or meta-analytic in nature, and—of course—tying this to realism in experimental conditions.

See here and here for more on this topic which we are blogging to death. I appreciate the comments we’ve had here from people who disagree with me on these issues: a blog comment thread is a great place to have a discussion back and forth involving multiple viewpoints.

“But I think it’s important to emphasize that all the statistical analysis and statistical rule-giving in the world can’t transform bad data into junk science.”

“Junk science” should be “Good science”?

Fixed; thanks.

Yes, I think this is the key point. Let me run past you the way I have tried to make it in teaching:

I point out that p-values address one specific type of potential error, the error of falsely extrapolating to a population a result obtained in a sample. By comparing the magnitude of test statistics/parameter values to the amount of variation around them, we get a sense of how plausible this extrapolation is. Even as a measure of this one sort of error, however, p-values are imperfect (for reasons that have been discussed in detail on this site).

But there are other possible sources of error, often much more salient. The data could be mismeasured (or proxy measurements might not be performing their proxy function very well), or the sample could be biased, or the model could be wrong. A meaningful assessment of the plausibility of a given result has to take into account all these types of potential error, guided of course by what has been learned beyond the bounds of this one study.

The message to downgrade the role of p-values is partly an attempt to draw attention to the limitations of this one statistic in sizing up sample-to-population error, but above all the goal is to increase attention to all the other dimensions of assessment.

I realize the error framework has some issues, but it’s familiar to students and it works to get the message across, I think.

“…but above all the goal is to increase attention to all the other dimensions of assessment.”

+1

You are right, the goal is to engender attention to all the other dimensions. But I think that the extent of preparation required to posit solutions is beyond any one individual. Some of these consortia efforts look promising.

Null hypothesis testing, p values, goodness of fit, partitioning variance, etc is easy to teach, especially

with GUI software; also helpful are decision trees in some texts. Experimental design,

Deriving model design from theory, deriving measurement from theory are very hard to teach.

To my knowledge (I’m now retired) there is no easy to use software that will do this

In the background. So we do what is easy and the numbers are reflected in journal

reviews and publication providing the incentive to keep up the good work.

We need more and better resources to help us teach (and learn) measurement, design, etc.

Larry:

Yes, also there’s the following reasoning which I’ve not seen explicitly stated but is I think how many people think. It goes like this:

– Researcher does a study which he or she thinks is well designed.

– Researcher obtains statistical significance. (Forking paths are involved, but the researcher is not aware of this.)

– Therefore, the researcher thinks that the sample size and measurement quality was sufficient. After all, the purpose of a high sample size and good measurements is to get your standard error down. If you achieved statistical significance, the standard error was by definition low enough. Thus in retrospect the study was just fine.

So part of this is self-interest: It takes less work to do a sloppy study and it can still get published. But part of it is, I think, genuine misunderstanding, an attitude that statistical significance retroactively solves all potential problems of design and data collection.

I think that a lot of it is indeed genuine misunderstanding. The concepts are subtle and complicated, with lots of ifs, ands, and buts. People naively try to simplify, and end up oversimplifying to the point of missing a lot. But then the oversimplifications *seem* understandable to others, and gets passed on, till someone else oversimplifies even further, losing more of what’s really going on, and so on.

Perhaps its time to try to draft a concise list of common shared or yet to be shared insights.

1. A p_value is just one view of what an experiment suggests/supports – there are many others that may well be better in various situations.

2. Consider p_values as continuous assessments and be wary of any thresholds it may or may not be under (or targeted alpha error levels).

3. Keep in mind that p_value assessments are based on the possibly questionable assumption of zero effect and zero systematic error as well as additional ancillary assumptions.

4. Realize that the real or penultimate inference considers the ensemble of studies (completed, ongoing and future), individual studies are just pieces in that, which only jointly allows the assessment of real uncertainty.

5. Be aware that informative prior (beyond the ensemble of studies) information, even if informally brought in as categorical qualifications (e.g. in large well done RCTs with large effects the assumption of zero systematic error is not problematic) maybe unavoidable – learning how to peer review priors so that they are not just seen personal opinion may also be unavoidable.

6. The above considerations must be highly motivated towards discerning what experiments suggest/support as well as quantifying the uncertainties in that, as all of them can be gamed for publication and career advantage.

7. All of this simply cannot be entrusted to single individuals or groups no matter how well meaning they attempt to be – bias and error are unavoidable and random audits may be the only way to overcome these.

8. ???

9. ???

Good to develop lists, important to be careful about wording given the penchant of list users to misread the meaning of items to fit what they want them to mean or what they “already know”.

Take your #1 start: “A p_value is just one view of what an experiment suggests/supports”

– I would keep P-values as a refutational tool only (which is how I think Fisher viewed them) and that means they suggest and support nothing. They only measure something: A P-value takes a “standardized” measure of distance D between the data and a model (e.g., a model which includes “no effect” among its assumptions, along with “no uncontrolled bias” and so on) and map that distance into the unit interval (0,1] using the inverse of the model’s implied sampling distribution for D. This probability transform supposedly makes the observed distance D more intelligible, although experience shows it doesn’t really do so in any practical sense. Hence I’ve been trying to resurrect Good’s 1957 suggestion to take one more step and take surprisal S =-log_2(P) as the bits of information in D (or P) against the model. This measure never supports the model, it just transforms the distance D to an information scale instead of a probability scale.

Then you say “there are many others that may well be better in various situations”: There are always other measures that capture aspects of the data that D and hence P and S don’t. Estimates are the chief example, and are always needed in addition to these model-discrepancy measures if one is seriously trying to extract all the useful, relevant information in the data about a scientific question. The severe model dependence of estimates however leads us back to checking the estimation models and hence to P or S, to make sure we aren’t basing estimates on models that our data scream are wrong. I think this was a core message in Box (1976-1983) and Cox as well. So your #1 should become

1) A p_value is just one of many tools for checking the compatibility between our data and a model or hypothesis of interest. Such checks are important to avoid estimation based on misleading models, but this task should not distract us from estimation as an essential step in answering scientific questions.

The rest I found more congenial – I especially liked 4,6,7 and think the warning “all of them can be gamed for publication and career advantage” needs special emphasis in basic teaching as well as in specialty articles and blogs.

As time goes on, I see measurement more and more as the key thing being ignored and in need of improvement. It’s especially striking given that psychology is one of the very small number of fields that has a sub-discipline dedicated to measurement (i.e., psychometrics; education is also very invested in this).

And the problem in research using humans is that in many realistic cases, increasing N often decreases measurement quality. If I want a survey of 5000 people, every survey item really counts with regard to cost. When I have a student sample, I can make them take my 40-question battery. When I first started grad school, I proposed a 30-item scale for a population survey and I thought they were going to laugh me out of the room.

Better measurement helps us get more out of our small samples and might help us devise methods to more economically use larger samples without too great of a loss of measurement quality. But if you think people hate trying to publish shaky statistical results, with some exceptions it’s a huge pain to publish measurement studies.

Concerning replication: If you abandon significance testing, what counts as a successful replication?