https://www.nytimes.com/2017/10/23/upshot/the-cookie-crumbles-a-retracted-study-points-to-a-larger-truth.html?_r=0 ]]>

http://www.sciencedirect.com/science/article/pii/S0950329317301362

]]>This is the audience for the article that gave the sloppy but easy to understand explanation of what a p-value is.

]]>The p-value is sometimes described as, “An index of surprise: How surprising would these results be if you assumed your hypothesis were false?”

However, the p-value is calculated using various assumptions (called model assumptions) that are difficult (or often impossible) to verify in any given case. Thus the p-value is usually a very iffy thing to use to draw any convincing conclusion.

]]>“The p-value is sometimes described as “an index of surprise”: How surprising would these results be if you assumed your hypothesis were false? However, the p-value is calculated using various assumptions (called model assumptions) that are difficult (or often impossible) to verify in any given case. Thus the p-value is usually a very iffy thing to use to draw any convincing conclusion.”

]]>I’m not sure if that is a question, because I think you already know the answer. The thing distinguishing the p-value, the reason it has been used (and misused) for more than a century, is the relation with the tail area of the probability distribution / likelihood function mentioned before.

Which also makes the sampling distribution uniform in [0 1] under the null hypothesis (at least in simple cases, leaving aside discrete distributions, composite hypothesis, etc).

And while this is not unique to p-values it also has the continuity and monotonicity properties that you seem to appreciate, not every sufficient statistic has that (notice that the magic number I proposed above is a sufficient statistic).

I meant as something to point the layperson to that explains why scientists care about it. This list seems to not be a solution, since even though I know what you are talking about I still don’t see it. Maybe if the “null hypothesis” == “research hypothesis”, but even then you will always have simplifications that render the tested hypothesis false…

]]>Which also makes the sampling distribution uniform in [0 1] under the null hypothesis (at least in simple cases, leaving aside discrete distributions, composite hypothesis, etc).

And while this is not unique to p-values it also has the continuity and monotonicity properties that you seem to appreciate, not every sufficient statistic has that (notice that the magic number I proposed above is a sufficient statistic).

]]>For a given p-value, a larger sample size will narrow the likelihood *and* move it closer to zero. If one remembers the relation between tail area and p-value is easy to see why, it’s unfortunate that you don’t want to include that in the definition.

Yes, sorry. I was thinking for a given effect size a larger sample will move the p-value closer to zero.

I think we agree that the p-value is not just an index, it has some interesting properties of its own

So then perhaps there should be a list that distinguishes a p-value from other sufficient statistics?

]]>But also you like to pick the low-hanging fruit (i.e. social psychology).

Actually I most like to “pick on” medical research (both preclinical and clinical) because one day I hope to be able to work in that field again (when it becomes standard to take your job seriously), and could for the most part care less about social psych. Eg, from the various replication project results preclinical cancer research looks much worse than social psych. I also “picked on” the LIGO analysis since I do not see why there was so much focus on the null model while the other various alternative explanations got a couple sentences in the main paper. Basically anywhere you find NHST I will see the same issues.

There are many people using NHST who aren’t “accepting the alternative” when they get low p-values.

In the case where the null hypothesis is the default “no difference between groups” not predicted by any theory what do you learn from the p-value?

For a lot of people in the social sciences the point estimate, standard error (and p-value) are just standard ways we describe our parameter estimates. In the same way Bayesians report the posterior mean and sd, etc. of parameter estimates.

There is nothing wrong with parameter estimation, but what does the p-value add? For the usual t-test at least, it is just a non-linear transform of the info contained in the point estimate and standard error. It contains less information and is more abstracted from the data.

]]>I think we agree that the p-value is not just an index, it has some interesting properties of its own. But is of course true that any sufficient statistic can be interpreted as an index into likelihood functions if one wishes to do so.

]]>You don’t need just the p-value and the sample size. You need also the model.

Yes, but for the common use case being covered the model will be the t-test, so it is “always” the same model (ignoring the equal sample size, etc variations).

Do you really think that a p-value is not a bit more informative that this index of mine?

I agree it is more informative. For a given p-value a larger sample size will narrow the likelihood, while a smaller will widen it (we can also put this as decreasing/increasing uncertainty). For a given sample size a larger p-value will move it closer to zero, while a smaller will move it farther away. What else is beyond that?

]]>Love this article’s logic ( https://www.hsph.harvard.edu/news/press-releases/recent-presidential-election-could-have-negative-impact-on-health/ ):

1. Trump elected;

2. Creates distress amongst some groups that *could* lead to “increased risk for disease, babies born too early, and premature death”

3. Hence, clinicians should suggest “psychotherapy or medication”.

Is all above possible? Yes

Likely? Not in my opinion.

Is there scientific evidence? I’d say published “evidence” consists mostly of uncontrolled studies serving as rhetorical devices for motivated reasoning.

]]>You could also index the likelihood functions using a number constructed from the data by some ridiculous mechanism like intercalating digits. Let’s say you have three measurements x1=1.234, x2=56.7 an x3=89 and you produce the magic number 58169.2703004. You can produce such a magic number for any dataset you got, and this magic numbers are an index for a unique (really unique in this case) likelihood function (the one corresponding to x1, x2, and x3). The values of x1, x2 and x3 can be trivially recovered from the “index” and the number of measurements used to create it, you don’t even need a lookup table!

Do you really think that a p-value is not a bit more informative that this index of mine?

]]>I would call “then if we reject the null hypothesis it means our hypothesis must be true” invalid logic, rather than “convoluted”.

Sure, but the convoluted part is that we are dealing with the “null hypothesis” and somehow must make it seem that we are evaluating the “research hypothesis”. If one is true the other must be false, but for some reason we need to check the former to draw a conclusion about the latter, etc.

I put in the stuff like confusing “rejecting” with “disproving” as the usual red herrings to obscure the real problem with the logic (that simply rejecting the null hypothesis in no way indicates the research hypothesis is correct). It is convoluted to begin with, and then becomes even moreso upon disentangling due to the comedy of errors stacked on top of each other. I really don’t think that parody of an explanation I offered is a strawman.

]]>“For the most common use-case (comparing two groups), the p-value and sample size are like the house number and street name for a curve (similar to the “bell curve”) that shows how relatively well various effect sizes would fit the data.” ]]>

The p-value only corresponds to a unique likelihood function when there is only one parameter and we’re doing a one-sided test. In general, multiple likelihood functions (i.e. multiple values of the parameter vector) can correspond to the same p-value.

I am not sure how well this works for other situations, but I would agree that adding “for the most common use-case of comparing two groups” would be better at this time. I also think that use-case is sufficient for a quick lay explanation. Explaining the likelihood function doesn’t seem so hard for this case either. It tells you the most likely parameter value (eg effect size) as well as how uncertain you are about it. It shows which values are more or less likely than others.

there is a meaning in the ordering of p-values and there is a meaning in the magnitude of p-values

[…]

(Apart from that, I find the including of sample size in that definition somewhat artificial. The sample size is part of the model. The likelihood functions for models with different numbers of observations live in different spaces. Whatever indexing capability into likelihood functions is provided by p-values, it’s unrelated to sample size.)

The indexing requires sample size. If you know the sample size and p-value you can then use a lookup table to get the pre-computed likelihood function, imagine this as a bunch of charts at the end of a textbook. If you know only the p-value you don’t know whether the location is far away from zero and likelihood is wide or it is close to zero and likelihood is narrow (this is the classic “statistical significance does not mean practical significance” issue that people discovered empirically).

Note I am really just trying to explain what a p-value actually is here, not how people are trying (incorrectly) to use it. How about:

“For the most common use-case (comparing two groups), the p-value and sample size index (are like the street name and house number for) a unique likelihood function. These likelihood functions are a way of seeing how relatively well various effect sizes would fit the data.”

]]>“People intuitively know that scientists are just supposed to be evaluating their hypothesis, not using some kind of convoluted logic like:

“We set a null hypothesis the opposite of our hypothesis (also called the alternative hypothesis) and try to disprove that, then if we reject the null hypothesis it means our hypothesis must be true… “”

I would call “then if we reject the null hypothesis it means our hypothesis must be true” invalid logic, rather than “convoluted”.

I realize that many people use this invalid logic, so when teaching frequentist statistics, I emphasize that rejecting the null hypothesis does not mean the null hypothesis is false — that rejecting the null hypothesis on the basis of a p-value is an *individual choice*, not a logical consequence, so must be accompanied by the acceptance that our choice might be incorrect.

]]>The p-value only corresponds to a unique likelihood function when there is only one parameter and we’re doing a one-sided test. In general, multiple likelihood functions (i.e. multiple values of the parameter vector) can correspond to the same p-value.

Even in the case where the bijective relationship holds, there are infinitely many alternative ways to define a one-to-one correspondence that could also act as a summary of the likelihood function (in the indexing sense) but lack the characteristics that make the p-value interesting: there is a meaning in the ordering of p-values and there is a meaning in the magnitude of p-values.

(Apart from that, I find the including of sample size in that definition somewhat artificial. The sample size is part of the model. The likelihood functions for models with different numbers of observations live in different spaces. Whatever indexing capability into likelihood functions is provided by p-values, it’s unrelated to sample size.)

]]>The p-value is just an intermediate calculation in this process that is compared to an arbitrary cutoff point. NHST does not require the p-value, any summary statistic can be used to perform this procedure.

]]>If you were writing an article for a popular audience in which you were explaining to them why relying on p < 0.05 (or p-values at all) is a problem, how would you explain it? Assume that the readers are generally educated and curious and possibly involved in research of some kind, but don't understand what a p-value is. I know you said above that you can't explain the logic of NHST to laypeople on account of it not making sense, but there are lots of people out there who think it makes sense and who explain it all the time (typically poorly). Assuming that there is some value in communicating the problem of relying on p < 0.05 to a general audience, this logic needs to be explained somehow, even if only for the purpose of eventually attacking it.

I run into this when talking to friends of mine who are using p-values in their own work. I'm not going to dissuade them from relying on p < 0.05 simply by asserting that they shouldn't do it; after all they live in a world where statistical significance is rewarded. I've got to explain the logic behind it in as clear a manner as possible first, and that's going to mean skipping over some considerations like the ones you listed above.

]]>The P-value and sample size together correspond to a unique likelihood function, and thus act as a summary of that function and the evidence quantified by that function.

https://arxiv.org/pdf/1311.0081

We’d need to explain a likelihood function is a way of showing the value and uncertainty about a model parameter being estimated. I think that is it though.

]]>However, it doesn’t define the null. The word “conditional” is also a bit technical for the layman (or maybe I’m supposed to say layperson?). But I agree that’s also a good definition.

> the “index of surprise” presupposes that the falsity of the null hypothesis would be surprising

I don’t think so. It works whether you think the null hypothesis is likely to be true (how surprising would it be for someone to toss five heads in a row, assuming the coin is fair) or false (how surprising would it be for Nadal to beat you five sets on a row, assuming both of you played equally well).

]]>for a specifically chosen random number generator ~ if you assumed the null hypothesis is true

Doesn’t look much better to me.

]]>I’m not sure what makes the quote about the p-value as indeed of surprise so bad.

This is the quote in question:

]]>“Instead, you can think of the p-value as an index of surprise. How surprising would these results be if you assumed your hypothesis was false?”

https://fivethirtyeight.com/features/science-isnt-broken/

I think everything else there is a red herring that distracts from the reasoning behind this quote.

1) Calling the p-value an “index” of surprise (I have also seen “measure”). It is never quite defined how this index/measure works. Eg, what type of measurement is it (categorical, ordinal, interval, ratio)? How exactly does it map to the amount of evidence for “your hypothesis”? Why not just look at the amount of evidence for/against your hypothesis instead of this index?

2) What is “surprise”? How is it defined? Isn’t the amount of surprise going to depend on the person, ie isn’t this “subjective”?

3) Then there is “if you assumed your hypothesis was false”. Well, the p-value is calculated using equations that assume the *null hypothesis* is true, so “your hypothesis” must amount to “not the null hypothesis”.

— a. First, we are leaving out the possibility to correctly use a p-value to asses what “your hypothesis” predicted.

— b. Second, is “your hypothesis” really amounting to “anything possible except the null hypothesis I tested”? Isn’t this a kind of unfair competition between the null hypothesis and “your hypothesis” since the former is a single value and the latter is every other possible outcome?

> Last year, for example, a study of more than 19,000 people showed that those who meet their spouses online are less likely to divorce (p < 0.002) [….] That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%

If a 22% reduction in divorce rates is "tiny", I wonder what kind of effect would she find worth reporting.

]]>Anoneuoid: You’re right, that particular point is not good — but most of the rest of it is (for the intended audience).

My position is that it’s impossible to explain the logic of NHST to the layperson because it makes no sense. People intuitively know that scientists are just supposed to be evaluating their hypothesis, not using some kind of convoluted logic like:

“We set a null hypothesis the opposite of our hypothesis (also called the alternative hypothesis) and try to disprove that, then if we reject the null hypothesis it means our hypothesis must be true… To decide whether to reject the null hypothesis we calculate a p-value which assumes the null hypothesis is true, therefore assuming our hypothesis must be false. Thus, the p-value is a “measure”, or “index”, of how surprised we would be to get our results if our hypothesis was false.

If we reject the null hypothesis, then we must accept our hypothesis since the results would be unsurprising if our hypothesis were true. If that is difficult to understand, think about it just like court, where the null hypothesis is assumed innocent until proven guilty. In other words, the burden of evidence is on the scientist to prove the null hypothesis is false, which would mean that their hypothesis is true.”

The only reason we still have this going on is that people who realize how much has been claimed based on NHST (ie not laypeople) are scared of the scope of damage that may have been done.

]]>GS: Just out of curiosity…what is “food science”? It doesn’t sound like a “field” to me. And it would be misnamed if it were since “it” surely is concerned with behavior in some way (as in rates of consumption of food etc.) Now, if that is the case, then the field is something like “regulation of food-intake” and would, thus, be a part of the natural science of behavior as well as the part of physiology specifically concerned with the physiological mediation of behavior. In that case, I don’t know exactly why you would not trust it any less than a great many fields.

]]>The 538 article you linked to above does seem to explain the issues unusually well for a non-statistician audience.

I disagree:

]]>Instead, you can think of the p-value as an index of surprise. How surprising would these results be if you assumed your hypothesis was false?

Yes!

Also: The 538 article you linked to above does seem to explain the issues unusually well for a non-statistician audience. It ought to be required reading for Research Methods courses in lots of fields. And such courses also ought to include reading papers and then PPPR’s of them of the sort you recommend.

]]>I agree with your assessment of the problem, and I think post-publication peer review is a good idea for this. Apropos this, one way PPPR I think needs to go is not just searching for fraud (ala Pubpeer) but genuine commenting on papers, discussing strengths and weaknesses, alternative approaches etc. This type of discussion allows an assessment of who is reliable and who, while not committing fraud, is pushing a story that their data might not fully justify. I have right in front of me a paper by an eminent physicist that is widely known in the tiny subfield to not be their best work, and is probably an artifact of the experimental system used. This is not a fraudulent paper, but (like BW, I suspect) is a reflection of a non-ideal experimental system, sloppy work, with resulting interpretations not fully justified. I am not even really against that it would be published, but a formal post-publication review would allow readers to understand that in the end, it is a crummy paper.

One frustrating thing for me about sites like Pubpeer is that they are now mostly used to search for fraud and not for discussion, but that is a topic for a different time…. ]]>

I agree that retraction/correction isn’t the best solution here. The problem is that retraction/correction aren’t scalable: each retraction or correction is taken as such a big deal. Even in extreme cases it’s hard to get a retraction or a meaningful correction if the author doesn’t want to do it. Remember that “gremlins” paper by Richard Tol that had almost as many errors as data points? Even that didn’t ever get retracted or fully corrected, I think.

It’s similar to the problem of using the death penalty to control crime. It’s just too much of a big deal. Each death sentence gets litigated forever, and very few people get executed because of (legitimate) concerns of executing an innocent person.

Scientific journals’ procedures for addressing published errors are, like capital punishment in the U.S., a broken system.

I don’t have any great solutions to crime and punishment. Flogging, maybe.

But for the journal system, I recommend post-publication review. In a post-publication comment you can point out problems with forking paths and other statistical problems, and you can also point out problems such as in Wansink’s work where a research team has, in the words on NIH, been “manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

]]>It isn’t just the media and journalists, of course. Journals behave the same way – errors of the nature that Jordan and Tim and Nick and James have found can lead to corrections and retractions. When was the last time a paper was corrected or retracted after it came out that the results were p-hacked?

]]>You are right of course about getting the media fired up, but I am a little wary about this type of media attention seeking since 1. the general public focuses on the wrong issues and this will lead to 2. the scientific community ignoring the outcry, since it is focused on the wrong issue. If the public mis-learns the issues, it just makes reform harder. Most of the posts in the mainstream media about BW have focused on the errors of his papers, and have given little or no attention to the p-hacking.

Some of this might be because it is easy to explain to people that someone made errors, but explaining why p-hacking is bad is hard to explain to a non-statistician. Journalists will then default to the easy, but less fundamental, issue.

I think a better approach is to explain the issues as clearly as possible (for example, the fivethirtyeight.com blog does a nice job of explaining p-hacking: https://fivethirtyeight.com/features/science-isnt-broken/). Alot of press has already come out of the p-hacking issue, and even the BW issue started out of p-hacking.

We’ll wait and see, I still bet that Wansink comes out of this relatively unscathed within his field.

]]>I agree that p-hacking is more serious than misreported sample sizes, typos, or whatever the case may be with granularity problems. However, consider this: if we didn’t point out that the numbers in Wansink’s papers didn’t add up would the general public (i.e. media) cared about Wansink’s blog post? I think it’s difficult to get people riled up about flexible analyses, small sample sizes, file drawer effect, etc., but it’s pretty easy to get people to understand that numbers in papers should add up.

This is simply a hypothesis, but I believe that people engaging in rampant p-hacking and poor study design are much more likely to inaccurately report their results, and be exposed by granularity testing. As a result, I do think granularity testing can be used as a sort of proxy for catching p-hacking.

Also, granularity testing has the ability to catch people who are just completely fabricating results, which is obviously far less prevalent than p-hacking, but is still worth trying to detect and expose.

]]>You guys have done fantastic work exposing researchers like Wansink, but you are overstating your case that his reputation is beyond salvaging. I suspect that the problems you found with his papers are rampant in the literature in his field, and this will lead to overall ignoring your criticisms of him, since they are prevalent in the field and his peers will perceive it as no big deal. I think alot more work needs to be done, not about individual researchers but about whole fields (obviously no one person can do all of this). ! Personally, I suspect that the food science field is full of poorly trained scientists and statisticians, and that most papers display these errors and that they will just circle the wagons.

We can test this hypothesis by simply waiting a few years and see if Wansink’s h-index had decreased. My bet is no.

On a different note, I think all the emphasis on granularity errors and other types of errors masks a larger problem. Based on Wansink’s original post, a huge swath of his work is clearly p-hacked. In my mind, once you are p-hacking, I don’t care if all of your statistics are error-free, the whole approach is just totally non-scientific and the work should be dismissed, even if each individual calculation adds up.This was the original criticism of Wansink, and it, in my mind, is the most egregious.

To put this another way, imagine the Pizza papers lacked any of the errors that you detected in your analysis, would Wansink’s work be any more correct than it is now? My answer is an affirmative no.

The real breakthrough of yours, Andrew’s and others work on the Wansink case (in my opinion) is that it opens up the opportunity for more general inquiry of whole sub-fields that may be suspect, such as food science. Wansink is one example, how many others are out there who are engaging in p-hacking?

]]>