David Allison sends along this juxtaposition:

Press Release: “A large-scale effort to reduce childhood obesity in two low-income Massachusetts communities resulted in some modest improvements among schoolchildren over a relatively short period of time…”

Study: “Overall, we did not observe a significant decrease in the percent of students with obesity from baseline to post intervention in either community in comparison with controls…”

Allison continues:

In the paper, the body of the text states:

Overall, we did not observe a significant decrease in the percent of students with obesity from baseline to post intervention in either community in comparison with controls (Community 1: −0.77% change per year, 95% confidence interval [CI] = −2.06 to 0.52, P = 0.240; Community 2: −0.17% change per year, 95% CI = −1.45 to 1.11, P = 0.795).

Yet, the abstract concludes “This multisector intervention was associated with a modest reduction in obesity prevalence among seventh-graders in one community compared to controls . . .”

The publicity also seems to exaggerate the findings, stating, “A large-scale effort to reduce childhood obesity in two low-income Massachusetts communities resulted in some modest improvements among schoolchildren over a relatively short period of time, suggesting that such a comprehensive approach holds promise for the future, according to a new study from Harvard T.H. Chan School of Public Health.”

I have mixed feelings about this one.

On one hand, we shouldn’t be using “p = 0.05” as a cutoff. Just cos a 95% conf interval excludes 0, it doesn’t mean a pattern in data reproduces in the general population; and just cos an interval *includes* 0, it doesn’t mean that nothing’s going on. So, with that in mind, sure, there’s nothing wrong with saying that the intervention “was associated with a modest reduction,” as long as you make clear that there’s uncertainty here, and the data are also consistent with a zero effect or even a modest increase in obesity.

On the other hand, there is a problem here with degrees of freedom available for researchers and publicists. The 95\% interval was [-2.1, 0.5], and this was reported as “a modest reduction” that was “holding promise for the future.” Suppose the 95% confidence interval had gone the other way and had been [-0.5, 2.1]. Would they have reported it as “a modest increase in obesity . . . holding danger for the future”? I doubt it. Rather, I expect they would’ve reported this outcome as a null (the p-value is 0.24, after all!) and gone searching for the positive results.

So there is a problem here, not so much with the reporting of this claim in isolation but with the larger way in which a study produces a big-ass pile of numbers which can then be mined to tell whatever story you want.

**P.S.** Just as a side note: above, I used the awkward but careful phrase “a pattern in data reproduces in the general population” rather than the convenient but vague formulation “the effect is real.”

**P.P.S.** I sent this to John Carlin, who replied:

That’s interesting and an area that I’ve had a bit to do with – basically there are zillions of attempts at interventions like this and none of them seem to work, so my prior distribution would be deflating this effect even more. The other point that occurred to me is that the discussion seems to have focussed entirely on the “time-confounded” before-after effect in the intervention group rather than the randomised(??) comparison with the control group – which looks even weaker.

John wanted to emphasize, though, that he’s not looked at the paper. So his comment represents a general impression, not a specific comment on what was done in this particular research project.

The question is, in a frequentist setting, if you don’t want to focus on p-values only, but also consider noise / uncertainty / standard errors as well – what kind of effects are left to report in a paper? This is not meant as an offence, it’s a serious question for those who try to reflect all the diffculties with statistical inference, but still need to find a wording to describe what they’ve found and want to tell…

Daniel:

There is this idea from over 20 years ago http://journals.sagepub.com/doi/abs/10.1111/j.1467-9280.1994.tb00281.x or more generally assess the data’s compatibility with a range of parameters values rather than just the zero effect.

In the larger scientific context, a single paper should just be pointing to a later meta-analysis where replication of results over studies can be critically assessed and (given adequate replication) the effectS jointly assessed.

While the counternull idea (your Rosenthal & Rubin cite) is interesting, as the estimate gets near the null, so does the counternull. So the counternull has a fatal drawback of being ever less informative as the point estimate approaches the null – which is precisely when we most need an interval estimate to avoid the fallacy of inferring the null because its P-value is big. Consider that when the point estimate equals the null, the counternull equals them both. The counternull only provides a range of values more compatible with the null than the null, and is no substitute for the confidence interval (CI).

That said, the CI is far from perfect too. I think the CI should not be called an uncertainty interval because the only uncertainty it captures is the conditional uncertainty about the parameter given certainty about the data-generation model (DGM) from which the CI is computed. Any uncertainties about that model (and there are usually plenty in real examples in health and social sciences) is not captured by the CI, or the posterior intervals (PI) computed from the same DGM – so both CI and PI are really ‘overconfidence intervals’. I find it more easy to address this problem using P-values than interval estimates, simply by recognizing that any observed P value may stem from a model violation to which P is sensitive (e.g., nonrandom selection); that is why small values do not require and thus cannot imply violation of the null, and large values do not require and thus cannot imply truth of the null.

The unfortunate thing about blowing off a result for having too high a p-value is that the sample average *is* the best estimate of the mean, if that’s all the data you have. The unfortunate thing about reporting a noisy result as meaningful is that the statistical uncertainty is so large that it’s hard to attach much significance (in the lay usage of the term) to the exact value of that average.

That leaves something like this: “There is too much statistical uncertainty to be sure, but for what it’s worth, the data for this experiment had a slight positive [or whatever] average. With more data, it might easily turn out to be negative [or whatever] instead.”

That sounds pretty weak, doesn’t it? But it does reflect the state of the data, which was also pretty weak.

Your suggested phrasing sounds a whole lot better (more in touch with the real world) than what is usually done.

+1

‘the sample average *is* the best estimate of the mean’

That’s not really true, it’s not necessarily a bad estimate of the mean, but it’s not an admissable estimate of the mean when n is bigger than 2 ;-) The purpose of the James-Stein estimator was really to show that the sample average isn’t the “best” estimate

https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator

I think you had a brain fart. It’s not admissible when d > 2. This is a univariate estimate and the sample average is admissible as far as I know.

+1. That’s tacit in the qualification ‘if that’s all the data you have’…

Ah, you’re right. Wald’s theorem still tells us that to choose a point estimate from the admissible class of point estimator procedures we need to search in the class of Bayesian decision theory solutions (or their equivalent). Choosing the sample mean is a procedure on the boundary of this class, with an improper flat prior. It may technically be admissible I’m not sure, but the implied prior is rarely what you’d call “reasonable” in any kind of real world problem. We’ve been through this before, if you think floating point numbers are a reasonable approximation to the whole number line for your problem, then the flat prior puts essentially 100% probability mass on the absolute value of your parameter being bigger than something like 10^300. The whole reason that floating point numbers are a good approximation to the number line for real applied problems is that they extend out to ridiculously large numbers like 10^300 that you’re never going to encounter. So the fact that you’re willing to use floats in computations already implies that you can’t really think that 10^300 is almost surely the size of your parameter.

Strangely, the James-Stein estimator essentially tells you that you can take all the data you have on your problem, and then look up data on two unrelated problems on wikipedia, and then get a better estimate for your problem.

…against a loss function that cares about the sum of squared errors in all of the problems.

Apropos of nothing, I wrote a blog post which poses a question to readers; I’d be interested in your feedback.

Going to take a look now.

Daniel:

I agree. I like to separate statistical bias from bias in what might be the actual population average if such a thing. Methods and estimators do not necessarily provide anything realistic, as this requires thought about whether it makes sense in light of previous information. To me, I think one of the most unfortunate things I observe among quant. psychologists is thinking the math and/or simulations has a 1 to 1 mapping in reality.

Daniel:

No. If the problems are unrelated enough, their parameters will be far enough apart that the Bayesian or so-called James-Stein estimator will do essentially no pooling. The argument you make is a common one in statistics (or, at least, it used to be commonly said thirty years ago) but it’s wrong, for roughly the same reason that it’s wrong to think that the Second Law of Thermodynamics is violated by that little demon who puts the fast molecules in one side of the gate and the slow molecules in the other. If you try to build the demon, you’ll find that he too is subject to the Second Law of Thermodynamics. Similarly, if you try to apply hierarchical modeling using unrelated problems, you’ll find that if you have a flat prior, this will work with probability zero; the “unrelated problems” strategy only works then the parameters are near to each other, which suggests that they are actually related, or else represent prior information.

For example, suppose you’re estimating a parameter that happens to be near the value 5 dollars, and you decide, just for laffs, to estimate this along with estimating the weight of a cat (which happens to weigh 5 pounds) and also a 5-pound steak. If you do this, your inferences will be partially pooled to be near 5 . . . but where did this come from? When evaluating the statistical properties of a method (and that’s a key part of the James-Stein argument, as you’re dealing with expected loss, averaging over some frequency distribution), then you need to average. If you’re always partially pooling your estimates by throwing in external parameters that are ostensibly unrelated but often happen to be very close to your parameter of interest, then this is an assumption that needs to go into the distribution you’re using to define your frequency properties. And if your external parameters are

notoften very close to your parameter of interest, then your James-Stein estimate won’t do any real partial pooling anyway.I’d write a paper or give a talk about this, but it doesn’t seem like a problem that people care about anymore, perhaps because of the general understanding that multilevel models work because they make use of real information; they’re not just a mathematical trick.

I think the point is more a mathematical existence issue than a practical tool. Of course we should use real Bayesian information. That’s basically the content of my previous comment about not using a big flat prior for a univariate estimate. The point here of the James-Stein estimator is not that it’s a good method, the point is really that the **COMMON ASSUMPTION** that “the sample average is the best estimate of the real mean” is not mathematically true in any sense.

In a practical sense, the best estimate comes from specifying a real-world prior and a real-world loss function, and doing Bayesian decision theory. But in a mathematical sense, the James-Stein estimator shows that even using basically no information you can still construct an estimator that is technically better. It shouldn’t be surprising that it’s not a lot better, as you’re using basically no information, but it’s still mathematically better, and so the value is in showing that a widely used common assumption is in fact a mistake.

Tom,

There is no such thing as a “best estimate” in statistical theory. Under certain regularity conditions, there are such things as “best estimators”. But, no, the sample mean is nothing like a “best” estimate, not in any mathematical sense.

Welcome to Bayesian statistics, where we have no estimators, only estimates! No confidence either, but lots of credibility.

If researchers were less confident but more credible, wouldn’t we all be a bit better off?

“There is no such thing as a “best estimate” in statistical theory”.

Hairsplitting, guys! I really meant “unbiased”, and these differences wouldn’t change my suggested wording at all. Would they?

unbiased is true, but that’s very different from “best”. “best” really implies that you shouldn’t be using any other method to estimate the mean, but in fact mathematically speaking the method you should be using is Bayes with some real-world prior information and real world utility. That’s more or less what Wald’s theorem was about. Only if it really is quite plausible to you that the mean could be either -10^300 or +10^300 would you use the raw mean usefully.

I actually think that your wording of the statement seems ok, but the statement about the sample average being the best available estimate of the true overall average was a common mistake that then confuses people. “if we already have the sample average as a best estimate, why are these fools doing Bayes and getting some other result?” The fact is, the biased estimate from Bayesian decision theory with informed prior and real-world utility function is overall better, sometimes MUCH better.

“The fact is, the biased estimate from Bayesian decision theory with informed prior and real-world utility function is overall better, sometimes MUCH better.”

Well, yes, *if* you can support that informed prior, and that utility function – they need to be more than just personal opinion. In this case, the one that started this whole conversation, I don’t see anything like that as being supported by what was reported.

“if we already have the sample average as a best estimate, why are these fools doing Bayes and getting some other result?”

Why, precisely to be able to incorporate some other knowledge, preferably actual data. If we had actual data, though, that was of the same kind as the experiment, we could just combine them without that much complication. The complication comes in when you want to bring in other information that isn’t strictly of the same kind: e.g., a prior distribution when all you have from the experiment is one set of points.

Anyway, my comment was about how one might report a very uncertain result, not about technicalities about better estimates. Let’s not lose sight of the real thread here.

Sure ok, just the part about “the best estimate” smacked me between the eyes ;-)

I am generally much less troubled by weak findings described as weak than by the use of the significance filter to characterize large findings as strong. The biggest problem with the former is the incessant desire to blame the weakness on the sample size rather than the weakness of the effect itself. That’s bad, but not horrible. After all, somebody might run a bigger study some day and resolve the uncertainty. A small mean effect that a larger sample finds to cross the 0.05 barrier is still a small mean.

The latter, however, is usually the main result of small sample papers and generally is what the paper is intended to sell and actually causes people to waste time in effects that are either Type M, or even Type S, both of which hold back actual progress.

“A small mean effect that a larger sample finds to cross the 0.05 barrier is still a small mean.”

Andrew says from time to time that the difference in two p-values is not in itself significant. Maybe you (some random reader of this blog, I mean) haven’t thought through the implications of this. It’s possible to show how a p-value threshold is a poor way to evaluate a data set by thinking about the p-value as a statistic itself. The p-value has a uniform distribution, and so it has a very large relative variance. The standard deviation is in the vicinity of 0.3 (where the p-value, of course, is in [0,1]). Your experimental p-value of 0.05 is a statistic. What is its variation? Hmm, 0.05 +/- 0.3! Well, we can’t really go below zero, but never mind.

So any claim that a result has a p-value less than, say, 0.05, is subject to the fact that this result (reaching that 0.05 value) cannot have much statistical significance (judging by the p-value of the p-value, to hoist the thing with its own petard). Maybe the “true” p-value is something else.

We could reduce the s.d. of the p-value from 0.3 down to 0.05 by running (0.3/0.05)^2 = 36 repetitions of the experiment. And even then, the (statistical) significance of the p-value is iffy, being 0.05 +/- 0.05.

This all doesn’t make me very interested in paying much attention to a p-value threshold.

I think I don’t understand this (I don’t get the idea of there being a “true” p-value ideas), but if I somewhat do, doesn’t this prove too much? Suppose the experiment produced not p-value = 0.05 but something tiny like 1e-1000. Well the standard deviation is still about 0.3, so it’s also not statistically significant from 0.05 – is that right? (And if it is, is that a useful thing to say?)

If the NULL is TRUE then the p is a uniform random variable (over repeated data collection). If you get p = 1e-1000 then the null is almost certainly not true.

Understood, but that’s my precisely what puzzles me.

Tom Passin’s argument seems to be that if the null is true, the p-value of the p-value (as if the first-order p-value is something real, to be estimated) will never have statistical significance. I don’t know really what this means (what is the “real” p-value, even given the null?) be he seems to think it’s worth noting and is (yet another) critique of p-values. But

the same argument criticises 1e-1000 just as much as 0.05, so I’m left questioning why I should find this mathematical argument at all damning.

bxg, I believe that Tom Passin is making some incorrect inferences based on the observation that under the null, the distribution of p-values is uniform. As Daniel says, the whole point of the p-value is to reject (or not) the hypothesis that one is sampling from the null. FWIW, a different attempt at studying the ‘meta-distribution’ of p values was done a while back by NNT, but I have doubts about the utility of this approach too (basically assuming a “true p-value”, which is not a construct that makes any sense to me), and don’t have anything to say about its technical accuracy either:

http://fooledbyrandomness.com/pvalues

I think more that the issue is this, suppose you test A vs B and find that A has a “significant” effect p = 0.02 and B has a non-significant effect p = 0.06. The usual followup is:

assume effect A is equal to its sample value, or close to it, and assume effect B is equal to 0. The difference in effect sizes is then A-B = A-0 which is largish… and “A is much better than B”.

But, if instead you tested the idea that A-B = 0 you might easily get p = 0.14 or 0.23 or something, basically there’s no p value based evidence that A-B is different from 0.

Hence the difference between significant (A) and not-significant (B) is not itself statistically significant (A-B=0 has p = 0.2 or whatever)

deciding to do stuff based on having gotten certain p values, and particularly based on having gotten *different sides of the threshold for two different treatments* is not a good way to decide what is or is not true/good/helpful/whatever.

Agreed bxg – there’s no such thing as the “population” p-value, to be estimated by a sample p-value whose distribution narrows around the true p-value as sampling variability is reduced.

I expect that nearly everyone here is in agreement that decision making should not be based on p-value thresholds, but this argument about an observed p-value being “significantly” different from 0.05 seems like a category error.

The vast majority of null models being tested right this moment have little to no approximation to reality or the researcher hypothesis whatsoever. Hence the associated p-values start out worthless and can only go down from there.

Rather than being about sampling, the “true p-value” more often refers to “the p-value if the person actually tested a null model they thought could be correct”.

It seems a bit absurd they just don’t report the BMI effect in addition to the proportion crossing the obesity threshold. It would seem to me important to know how much the relative weight changes. If we redefine the obesity cutoff next year these numbers become completely useless. OTOH, the relative weight could just be looked with the new cutoff.

“p=0.24: “Modest improvements” if you want to believe it, a null finding if you don’t.”

To be fair, the Bayesian interpretation is “p = 0.24: Modest improvements if you thought the idea was probable a priori, mostly noise if you thought it wasn’t.”…with an important exception when you have informative priors about nuisance parameters.