The vast majority of null models being tested right this moment have little to no approximation to reality or the researcher hypothesis whatsoever. Hence the associated p-values start out worthless and can only go down from there.

Rather than being about sampling, the “true p-value” more often refers to “the p-value if the person actually tested a null model they thought could be correct”.

]]>Agreed bxg – there’s no such thing as the “population” p-value, to be estimated by a sample p-value whose distribution narrows around the true p-value as sampling variability is reduced.

I expect that nearly everyone here is in agreement that decision making should not be based on p-value thresholds, but this argument about an observed p-value being “significantly” different from 0.05 seems like a category error.

]]>I think more that the issue is this, suppose you test A vs B and find that A has a “significant” effect p = 0.02 and B has a non-significant effect p = 0.06. The usual followup is:

assume effect A is equal to its sample value, or close to it, and assume effect B is equal to 0. The difference in effect sizes is then A-B = A-0 which is largish… and “A is much better than B”.

But, if instead you tested the idea that A-B = 0 you might easily get p = 0.14 or 0.23 or something, basically there’s no p value based evidence that A-B is different from 0.

Hence the difference between significant (A) and not-significant (B) is not itself statistically significant (A-B=0 has p = 0.2 or whatever)

deciding to do stuff based on having gotten certain p values, and particularly based on having gotten *different sides of the threshold for two different treatments* is not a good way to decide what is or is not true/good/helpful/whatever.

]]>bxg, I believe that Tom Passin is making some incorrect inferences based on the observation that under the null, the distribution of p-values is uniform. As Daniel says, the whole point of the p-value is to reject (or not) the hypothesis that one is sampling from the null. FWIW, a different attempt at studying the ‘meta-distribution’ of p values was done a while back by NNT, but I have doubts about the utility of this approach too (basically assuming a “true p-value”, which is not a construct that makes any sense to me), and don’t have anything to say about its technical accuracy either:

http://fooledbyrandomness.com/pvalues

Understood, but that’s my precisely what puzzles me.

Tom Passin’s argument seems to be that if the null is true, the p-value of the p-value (as if the first-order p-value is something real, to be estimated) will never have statistical significance. I don’t know really what this means (what is the “real” p-value, even given the null?) be he seems to think it’s worth noting and is (yet another) critique of p-values. But

the same argument criticises 1e-1000 just as much as 0.05, so I’m left questioning why I should find this mathematical argument at all damning.

If the NULL is TRUE then the p is a uniform random variable (over repeated data collection). If you get p = 1e-1000 then the null is almost certainly not true.

]]>I think I don’t understand this (I don’t get the idea of there being a “true” p-value ideas), but if I somewhat do, doesn’t this prove too much? Suppose the experiment produced not p-value = 0.05 but something tiny like 1e-1000. Well the standard deviation is still about 0.3, so it’s also not statistically significant from 0.05 – is that right? (And if it is, is that a useful thing to say?)

]]>Sure ok, just the part about “the best estimate” smacked me between the eyes ;-)

]]>“The fact is, the biased estimate from Bayesian decision theory with informed prior and real-world utility function is overall better, sometimes MUCH better.”

Well, yes, *if* you can support that informed prior, and that utility function – they need to be more than just personal opinion. In this case, the one that started this whole conversation, I don’t see anything like that as being supported by what was reported.

“if we already have the sample average as a best estimate, why are these fools doing Bayes and getting some other result?”

Why, precisely to be able to incorporate some other knowledge, preferably actual data. If we had actual data, though, that was of the same kind as the experiment, we could just combine them without that much complication. The complication comes in when you want to bring in other information that isn’t strictly of the same kind: e.g., a prior distribution when all you have from the experiment is one set of points.

Anyway, my comment was about how one might report a very uncertain result, not about technicalities about better estimates. Let’s not lose sight of the real thread here.

]]>To be fair, the Bayesian interpretation is “p = 0.24: Modest improvements if you thought the idea was probable a priori, mostly noise if you thought it wasn’t.”…with an important exception when you have informative priors about nuisance parameters.

]]>unbiased is true, but that’s very different from “best”. “best” really implies that you shouldn’t be using any other method to estimate the mean, but in fact mathematically speaking the method you should be using is Bayes with some real-world prior information and real world utility. That’s more or less what Wald’s theorem was about. Only if it really is quite plausible to you that the mean could be either -10^300 or +10^300 would you use the raw mean usefully.

I actually think that your wording of the statement seems ok, but the statement about the sample average being the best available estimate of the true overall average was a common mistake that then confuses people. “if we already have the sample average as a best estimate, why are these fools doing Bayes and getting some other result?” The fact is, the biased estimate from Bayesian decision theory with informed prior and real-world utility function is overall better, sometimes MUCH better.

]]>“A small mean effect that a larger sample finds to cross the 0.05 barrier is still a small mean.”

Andrew says from time to time that the difference in two p-values is not in itself significant. Maybe you (some random reader of this blog, I mean) haven’t thought through the implications of this. It’s possible to show how a p-value threshold is a poor way to evaluate a data set by thinking about the p-value as a statistic itself. The p-value has a uniform distribution, and so it has a very large relative variance. The standard deviation is in the vicinity of 0.3 (where the p-value, of course, is in [0,1]). Your experimental p-value of 0.05 is a statistic. What is its variation? Hmm, 0.05 +/- 0.3! Well, we can’t really go below zero, but never mind.

So any claim that a result has a p-value less than, say, 0.05, is subject to the fact that this result (reaching that 0.05 value) cannot have much statistical significance (judging by the p-value of the p-value, to hoist the thing with its own petard). Maybe the “true” p-value is something else.

We could reduce the s.d. of the p-value from 0.3 down to 0.05 by running (0.3/0.05)^2 = 36 repetitions of the experiment. And even then, the (statistical) significance of the p-value is iffy, being 0.05 +/- 0.05.

This all doesn’t make me very interested in paying much attention to a p-value threshold.

]]>“There is no such thing as a “best estimate” in statistical theory”.

Hairsplitting, guys! I really meant “unbiased”, and these differences wouldn’t change my suggested wording at all. Would they?

]]>Daniel:

I agree. I like to separate statistical bias from bias in what might be the actual population average if such a thing. Methods and estimators do not necessarily provide anything realistic, as this requires thought about whether it makes sense in light of previous information. To me, I think one of the most unfortunate things I observe among quant. psychologists is thinking the math and/or simulations has a 1 to 1 mapping in reality.

Going to take a look now.

]]>I think the point is more a mathematical existence issue than a practical tool. Of course we should use real Bayesian information. That’s basically the content of my previous comment about not using a big flat prior for a univariate estimate. The point here of the James-Stein estimator is not that it’s a good method, the point is really that the **COMMON ASSUMPTION** that “the sample average is the best estimate of the real mean” is not mathematically true in any sense.

In a practical sense, the best estimate comes from specifying a real-world prior and a real-world loss function, and doing Bayesian decision theory. But in a mathematical sense, the James-Stein estimator shows that even using basically no information you can still construct an estimator that is technically better. It shouldn’t be surprising that it’s not a lot better, as you’re using basically no information, but it’s still mathematically better, and so the value is in showing that a widely used common assumption is in fact a mistake.

]]>If researchers were less confident but more credible, wouldn’t we all be a bit better off?

]]>Daniel:

No. If the problems are unrelated enough, their parameters will be far enough apart that the Bayesian or so-called James-Stein estimator will do essentially no pooling. The argument you make is a common one in statistics (or, at least, it used to be commonly said thirty years ago) but it’s wrong, for roughly the same reason that it’s wrong to think that the Second Law of Thermodynamics is violated by that little demon who puts the fast molecules in one side of the gate and the slow molecules in the other. If you try to build the demon, you’ll find that he too is subject to the Second Law of Thermodynamics. Similarly, if you try to apply hierarchical modeling using unrelated problems, you’ll find that if you have a flat prior, this will work with probability zero; the “unrelated problems” strategy only works then the parameters are near to each other, which suggests that they are actually related, or else represent prior information.

For example, suppose you’re estimating a parameter that happens to be near the value 5 dollars, and you decide, just for laffs, to estimate this along with estimating the weight of a cat (which happens to weigh 5 pounds) and also a 5-pound steak. If you do this, your inferences will be partially pooled to be near 5 . . . but where did this come from? When evaluating the statistical properties of a method (and that’s a key part of the James-Stein argument, as you’re dealing with expected loss, averaging over some frequency distribution), then you need to average. If you’re always partially pooling your estimates by throwing in external parameters that are ostensibly unrelated but often happen to be very close to your parameter of interest, then this is an assumption that needs to go into the distribution you’re using to define your frequency properties. And if your external parameters are *not* often very close to your parameter of interest, then your James-Stein estimate won’t do any real partial pooling anyway.

I’d write a paper or give a talk about this, but it doesn’t seem like a problem that people care about anymore, perhaps because of the general understanding that multilevel models work because they make use of real information; they’re not just a mathematical trick.

]]>Welcome to Bayesian statistics, where we have no estimators, only estimates! No confidence either, but lots of credibility.

]]>…against a loss function that cares about the sum of squared errors in all of the problems.

Apropos of nothing, I wrote a blog post which poses a question to readers; I’d be interested in your feedback.

]]>While the counternull idea (your Rosenthal & Rubin cite) is interesting, as the estimate gets near the null, so does the counternull. So the counternull has a fatal drawback of being ever less informative as the point estimate approaches the null – which is precisely when we most need an interval estimate to avoid the fallacy of inferring the null because its P-value is big. Consider that when the point estimate equals the null, the counternull equals them both. The counternull only provides a range of values more compatible with the null than the null, and is no substitute for the confidence interval (CI).

That said, the CI is far from perfect too. I think the CI should not be called an uncertainty interval because the only uncertainty it captures is the conditional uncertainty about the parameter given certainty about the data-generation model (DGM) from which the CI is computed. Any uncertainties about that model (and there are usually plenty in real examples in health and social sciences) is not captured by the CI, or the posterior intervals (PI) computed from the same DGM – so both CI and PI are really ‘overconfidence intervals’. I find it more easy to address this problem using P-values than interval estimates, simply by recognizing that any observed P value may stem from a model violation to which P is sensitive (e.g., nonrandom selection); that is why small values do not require and thus cannot imply violation of the null, and large values do not require and thus cannot imply truth of the null.

]]>Strangely, the James-Stein estimator essentially tells you that you can take all the data you have on your problem, and then look up data on two unrelated problems on wikipedia, and then get a better estimate for your problem.

]]>Ah, you’re right. Wald’s theorem still tells us that to choose a point estimate from the admissible class of point estimator procedures we need to search in the class of Bayesian decision theory solutions (or their equivalent). Choosing the sample mean is a procedure on the boundary of this class, with an improper flat prior. It may technically be admissible I’m not sure, but the implied prior is rarely what you’d call “reasonable” in any kind of real world problem. We’ve been through this before, if you think floating point numbers are a reasonable approximation to the whole number line for your problem, then the flat prior puts essentially 100% probability mass on the absolute value of your parameter being bigger than something like 10^300. The whole reason that floating point numbers are a good approximation to the number line for real applied problems is that they extend out to ridiculously large numbers like 10^300 that you’re never going to encounter. So the fact that you’re willing to use floats in computations already implies that you can’t really think that 10^300 is almost surely the size of your parameter.

]]>+1. That’s tacit in the qualification ‘if that’s all the data you have’…

]]>Tom,

There is no such thing as a “best estimate” in statistical theory. Under certain regularity conditions, there are such things as “best estimators”. But, no, the sample mean is nothing like a “best” estimate, not in any mathematical sense.

]]>I think you had a brain fart. It’s not admissible when d > 2. This is a univariate estimate and the sample average is admissible as far as I know.

]]>‘the sample average *is* the best estimate of the mean’

That’s not really true, it’s not necessarily a bad estimate of the mean, but it’s not an admissable estimate of the mean when n is bigger than 2 ;-) The purpose of the James-Stein estimator was really to show that the sample average isn’t the “best” estimate

]]>+1

]]>Your suggested phrasing sounds a whole lot better (more in touch with the real world) than what is usually done.

]]>The latter, however, is usually the main result of small sample papers and generally is what the paper is intended to sell and actually causes people to waste time in effects that are either Type M, or even Type S, both of which hold back actual progress.

]]>Daniel:

There is this idea from over 20 years ago http://journals.sagepub.com/doi/abs/10.1111/j.1467-9280.1994.tb00281.x or more generally assess the data’s compatibility with a range of parameters values rather than just the zero effect.

In the larger scientific context, a single paper should just be pointing to a later meta-analysis where replication of results over studies can be critically assessed and (given adequate replication) the effectS jointly assessed.

]]>That leaves something like this: “There is too much statistical uncertainty to be sure, but for what it’s worth, the data for this experiment had a slight positive [or whatever] average. With more data, it might easily turn out to be negative [or whatever] instead.”

That sounds pretty weak, doesn’t it? But it does reflect the state of the data, which was also pretty weak.

]]>