Baruch Eitam writes:

So I have been convinced by the futility of NHT for my scientific goals and by the futility of of significance testing (in the sense of using p-values as a measure of the strength of evidence against the null). So convinced that I have been teaching this for the last 2 years. Yesterday I bump into this paper [“To P or not to P: on the evidential nature of P-values and their place in scientific inference,” by Michael Lew] which I thought makes a very strong argument for the validity of using significance testing for the above purpose. Furthermore—by his 1:1 mapping of p-values to likelihood functions he kind of obliterates the difference between the Bayesian and frequentist perspectives. My questions are 1. is his argument sound? 2.what does this mean regarding the use of p-values as measures of strength of evidence?

I replied that it all seems a bit nuts to me. If you’re not going to use p-values for hypothesis testing (and I agree with the author that this is not a good idea), why bother with p-values at all. It seems weird to use p-values to summarize the likelihood; why not just use the likelihood and do Bayesian inference directly? Regarding that latter point, see this paper of mine on p-values.

Eitam followed up:

But aren’t you surprised that the p-values

dosummarize the likelihood?

I replied that I did not read the paper in detail, but or any given model and sample size, I guess it makes sense that any two measures of evidence can be mapped to each other.

It doesn’t obliterate the perspective because there is no integration over the prior in the likelihood-based perspective. The big problem for the likelihood-based perspective is nuisance parameters, which Bayesians can just integrate out.

For a two-sample t-test, lots of statistics summarize the very same information in a data set (essentially the signal to noise ratio). If you know the sample sizes and any one of the statistics, you can calculate all the others (e.g., t-value, Cohen’s d, CI of Cohen’s d, likelihood ratio, dAIC, dBIC). An on-line app is at

http://psych.purdue.edu/~gfrancis/EquivalentStatistics/

Advocates of different statistics are thus not arguing about which information in the data set is relevant but about how to interpret that information. These different interpretations can be radically different, so users of these statistics need to think carefully about how to interpret the information in their data set.

+1 to “users of these statistics need to think carefully about how to interpret the information in their data set.” So often the thinking is lacking.

(BTW, on NPR this morning I heard the lovely phrase “small effectology” to describe the search for low p-values without looking at effect size.)

> But aren’t you surprised that the p-values do summarize the likelihood?

In general there is a loss of information except under special assumptions (e.g. Normal(mu,sigma))and involving a particular parameterization of interest (mu/sigma).

The likelihood function (more than just the MLE and SE of the MLE) is the minimal sufficient statistics (statistic of of lowest dimension) that does not lose information. Except for exponential and transformation statistical models they are the same dimension as the number of observations.

The initial thought that this was something special was from Fisher who opined that authors need only report the (log) likelihood from their studies so that latter authors could do a meta-analysis without the need for the raw data. Not every thoughtful as then others cannot then check the fit of the models and if it is found too poor (over the multiple studies) they now cannot change the assumption without losing information.

But an intriguing technical topic Fraser, D.A.S. & Naderi, A. (2007). Minimal sufficient statistics emerge from the observed likelihood function. Int. J. Stat. Sci., 6, 55–61. Available at http://www.utstat.toronto.edu/dfraser/documents/238.pdf, accessed on September 18, 2013.

Greg’s app nicely demonstrates a case with special assumptions.

Clicking the show parameters button, we see that the statistics all relate to “standardized effect size” and always assuming a “non-central t-distribution”.

In such a case, they all reflect the same information but different aspects or implications/interpretations of it.

Related – take a p-curve (observed p-value as a function of theta). Differentiate wrt theta and plot. For location scale problems this should (I think) give you the posterior obtained by assuming a Jeffreys prior. This is Fisher’s Fiducial distribution. Historically the differences that occur for non location scale and/or nuisance parameter problems was taken as an argument in favour of Bayes over Fiducial. Fraser has argued that the Fisher route is actually the correct way to go. See eg his ‘Is Bayes just quick and dirty Confidence’ if interested.

Blech. For Fraser, “probability” means frequency — strictly that and nothing else. He seems incapable of conceiving of people

notattaching that meaning to their own use of the word; for me, reading Fraser of Bayes is just… well, just this.Well I did say for those interested…and I didn’t necessarily endorse his interpretation.

The first time I read Fraser on Bayes I had a similar reaction. Closer reading convinced me there is a valuable point there, despite not agreeing with everything there and having to look beyond some of the language.

Really, I have my own idiosyncratic views that don’t really fit neatly into Bayes, Freq or Likelihood paradigms. I’ve yet to see an especially convincing argument for the fundamental *principles* of any of these, despite looking, though in *practice* each can be used sensibly.

My impression is that a guy named Thomas Severini basically covered all this stuff last decade (and managed to do it without wasting half his word count on his own misapprehensions about Bayes).

Severini doesn’t make the same connection to Fisher’s fiducial approach that Fraser did, but I’m doubtful that those connections are quite as strong as Fraser seems to think they are. I might be wrong, but my current understanding is that the idea that Fisher meant his fiducial inference to generate confidence distributions is falsified by his fiducial solution to the Fisher-Behrens problem. This is the problem of comparing normal means with variances unknown and not assumed equal (the Welch two-sample t-test is the standard test nowadays); IIRC Fisher’s preferred solution involved marginalizing out the nuisance variance parameters using their fiducial distributions and getting a fiducial distribution for the difference of means that provably lacked correct coverage.

Yes, I believe that Fraser would certainly say that dealing with nuisance parameters requires care.

So the ‘Fisher route’ has to be interpreted in terms of using whatever methods – typically higher-order likelihood asymptotics etc for Fraser et al – to eliminate nuisance parameters to get ‘targeted’ probability statements about parameters of interest. Reid, Fraser, Cox etc have written a lot on this, and it goes back a long way. They also have papers on using data-dependent matching priors etc.

Many others like Severini have probably contributed, I definitely don’t know the literature that well. One source of confusion seems to be that the inversion and marginalisation steps don’t commute, so this requires care.

Here are my comments on a paper by Fraser.

Yes exactly (and more generally it is something more like a Jeffreys prior than strictly just a uniform prior, right?), which brings things back to the topic of one of your other (?) recent posts.

That is, the need for prior information. Of course every statistical approach can include additional information, and there is often further mathematical relationships between these. Eg data augmentation, expected loss/bias-variance etc etc.

So, if (the derivative of) a p-curve is similar to a posterior under a non-informative prior, among many other relationships, and including additional information works similarly in different approaches, why the big fuss (in general, not from Andrew).

>”[Lew 2013] makes a very strong argument for the validity of using significance testing for the above purpose”

I 100% disagree. For example, from the paper:

“It is notable, and perhaps useful for pedagogic purposes, that there is no sudden change in the distribution of P-values at any level and so there is not a `natural’ place to set a threshold for dichotomizing the P-values into significant and not significant.”

The use of p-values described in that paper (as a summary statistic) is valid. I would say the usefulness of this summary statistic is extremely limited, but it *ist* valid. This not the use advocated by the people teaching and publishing NHST.

Maybe Michael Lew will correct me, but I thought a major point of the paper was that there are other equally valid definitions of the p-value, the standard one (about seeing such data if the null hypothesis is true) is “correct”, but confusing and misleading. It does not capture why scientists would find p-values at all interesting, and teaching it to them confuses the hell out them.

Sure. but i don’t see the disagreement. “the above purpose” refers to strength of evidence against the null and my surprise was that the p value (contra to what i thought were, for example Royall’s claims) not to the current practice of NHT. also, could you please elaborate on why it is very limited (especially, for a person doing experimental research).

Sure about NHT. but I don’t see the disagreement with what I wrote. “validity” refers back to the p-value [as a measure of strength of evidence]. regardless of usability the fact that the p has a 1:1 relationship to likelihood surprised me and I don’t think i read this anywhere before.

While we are at it could say a bit more about why this statistic is extremely limited (i.e. more so than likelihood is)? Also, if there is such a 1:1 correspondence would it make sense to do the Royall move (and get rid of the probabilities altogether) and present ratios of p-values?

I meant that, from my reading, the Lew (2013) paper makes a distinction between significance testing, hypothesis testing, and the evidential use of p-values. Once again, from my reading, the paper says the last is valid but the first two are a result of confusion. However, maybe I am somehow misreading it since I agree so much with that conclusion… The author of that paper comments on this blog, hopefully he stops by to clarify.

When I say the use of the p-value as as summary statistic is limited, I mean that it is a form of lossy compression of the “effect” and sample size information. Usually there is no obstacle to simply reporting those values, in which case the p-value doesn’t add anything, it is spurious.

Like the paper recommends, it would be better to go directly from effect + sample size to a plot of the likelihood. That would be a more enlightening way to look at the same information, while the p-value is a less enlightening way.

> the fact that the p has a 1:1 relationship to likelihood

What do you mean by 1:1 ? If you know the form of the likelihood function you can calculate a p-value, but not the other way around.

The loss of information is mentioned in the paper you linked to: “inevitably the likelihood function provides a more complete depiction of that evidence than the P-value, not only because it shows the strength of evidence on an easily understood scale, but also because it shows how that evidence relates to a continuous range of parameter values.”

>”If you know the form of the likelihood function you can calculate a p-value, but not the other way around.”

It is the other way around, though. A major point of the paper is that p-values index likelihood functions, ie you can make a lookup table of likelihood functions and use p-values as the identifier.

To emphasize, there could be a book with a bunch of pre-made plots of likelihood functions each labeled p=.12, p= .0043, etc. I can tell you I saw p = .324, and you can look up what the corresponding likelihood function looks like in the book.

There’d need to be a separate book (or at least chapter) for every model though, “null hypothesis = normal(0,sigma) with sigma known”, “null hypothesis = normal(0,sigma) with sigma unknown”, “null hypothesis = t_distribution(0,scale,n)” for each degree of freedom n, “null hypothesis = chi-squared(n) for each n”, “null hypothesis = uniform random selection from a set of K distinct elements whose values are…”

Very true. This paper only considers t-tests. I am not sure if the observation can be extended beyond that.

Likelihood(parameters, data) + null hypothesis (i.e. parameters corresponding to H0 + sampling distribution under H0) + observed data => p-value

p-value + Likelihood(parameters, data) + null hypothesis (i.e. parameters corresponding to H0 + sampling distribution under H0) => Likelihood(parameters under H0, actual data)

I don’t see how the p-value provides much information unless you already know the form of the likelihood function. And in general you cannot recover the likelihood function corresponding the data, because there may be multiple possibilities consistent with the p-value.

For example, consider the simple case x=Normal(theta,1) where theta0=0 and you do a two-tailed test. You would get the same p-value when you observe x=1 and x=-1. Even if you can use your “index” to go back from your p-value to the value of the likelihood function at theta0, you wouldn’t know if the likelihood function is Likelihood(theta,1) or Likelihood(theta,-1), because Likelihood(theta0,1)=Likelihood(theta0,-1).

Fraser takes an observed p-value to just be the ‘statistical location’ of the observed data relative to a model. In the normal case this is just the ‘one-tailed location’ ie observed cdf value. In this case x=-1 and x = 1 are different.

Anyway, there is a lot of literature on converting back and forward between p-values/p-curves and likelihood functions. Typically the goal is to go from likelihood to p-values than vice versa (see eg https://www.jstor.org/stable/2290557) but given the usual available information it is probably possible to go the other way. In particular, you probably have an idea of what model family you are working within in simple cases (how else would you evaluate the likelihood).

(And for location models it’s a simple change of variables to relate a sample space integral, ie pvalue, to a parameter space integral. Then differentiate wrt the parameter, obviously assuming regularity conditions).

I’m not sure if you are in agreement or disagreement with what I wrote (or maybe you were just providing additional info). By the way, I didn’t mention the role of the test statistic used to go from data to p-values and back, in the example the value of the likelihood function was used implicitly but it could be something else.

I understand that you can “recover” the observed data [*] from the p-value in some cases [**]. And this means that, in those cases, you can determine the likelihood function (as you would normally do when you have a model and some data). It just doesn’t seem very surprising or interesting to me, I may be missing something.

[*] the sufficient statistic, which contains all the information in the data as far as the model is concerned

[**] when the sufficient statistic is one-dimensional and has a well-behaved bijective mapping to the test statistic used to calculate the p-value

Probably mostly in agreement. Just pointing out that there are interesting and useful relations between p-value functions, likelihood functions, Bayesian survivor functions, posteriors etc etc.

Which ones you report depends on your inferential aims, but it might be helpful to understand the relations between them.

More generally, just make your raw data available and report the model/assumptions/procedures used to analyse the data.

ojm: “More generally, just make your raw data available and report the model/assumptions/procedures used to analyse the data”

I would agree, but rather say must make your raw data available (with limited access if confidential).

I think it was one of Fisher’s mistakes to think that if one was satisfied with model fit in a given study, that one only need to report the log-likelihood to other authors – so they could add it to their study’s log-likelihood.

That is because if you pool the lack of fit across studies you might become uncomfortable with individually conducted assessments of fit.

Keith:

Well put. The likelihood (or, for that matter, the posterior distribution) can be considered a summary of data, but only conditional on the model, which typically has many arbitrary elements.

Keith – yup, good point

Presumably the point is something like – if you know the model and/or what the p-value would be for a range of theta values then you have more information available to carry out the inversion, ie you have derivative information. There will be some loss of information in general, but it may be surprisingly low, and is probably exact is some cases (eg normal). If you also have sample space derivatives available then you can do even better.