Well put when you say that what we really want to know is, “Should we act as if our hypothesis is true?”.

But exactly wrong when you say the p-value is “great” for that purpose. It is very easy to describe two situations with identical p-values but where the sensible person would confidently reject the null for situation 1 and confidently hold on to the null in situation 2. The p-value can never tell you how certain you should be that any particular hypothesis is true or false.

Tim

]]>It’s a worrying thought that among the many papers published in psycholinguistics, there may be literally nothing there. I think this is OK, if we are open about the fact that the purpose of publishing papers is not actually do science, but to get jobs in academia. Because the number and journal-name of publications is what we look at to decide who to hire. I have strong evidence (p<0.05, because why the hell not) that people don't even read beyond the title of the paper they cite. They don't even read the author names fully once they recognize a brand name researcher, or think they have recognized a name they know.

]]>“””

One potential criticism of our findings is that our question is essentially a trick question: researchers clearly know that 8.2 is greater than 7.5, but they might perceive that asking whether 8.2 is greater than 7.5 is too easy a question and hence they focus on whether the difference is statistically significant. However, asking whether a p-value of 0.27 is statistically significant is also trivial, so this criticism does not resolve why researchers focus on the statistical significance of the difference rather than on the difference itself. A related potential criticism regards our question as a trick question for a different reason: by including a pvalue, we naturally lead researchers to focus on statistical significance. However, this is essentially our point: researchers are so trained to focus on statistical significance that the mere presence of a p-value leads them to automatically view everything through the lens of the NHST paradigm even when it is not warranted. Moreover, in further response to such criticisms, we note that we stopped just short of explicitly telling participants that we were asking for a description of the observed data rather than asking them to make a statistical inference (e.g., response options read, “Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B” and similarly; emphasis added).

“””

Sure, but in the absence of a stable process (or a stable component in the process), the prospects of successfully going beyond description are bleak.

]]>Marginally Significant Effects as Evidence for Hypotheses

Changing Attitudes Over Four Decades

Some effects are statistically significant. Other effects do not reach the threshold of statistical significance and are sometimes described as “marginally significant” or as “approaching significance.” Although the concept of marginal significance is widely deployed in academic psychology, there has been very little systematic examination of psychologists’ attitudes toward these effects. Here, we report an observational study in which we investigated psychologists’ attitudes concerning marginal significance by examining their language in over 1,500 articles published in top-tier cognitive, developmental, and social psychology journals. We observed a large change over the course of four decades in psychologists’ tendency to describe a p value as marginally significant, and overall rates of use appear to differ across subfields. We discuss possible explanations for these findings, as well as their implications for psychological research.

]]>[1] Related to this idea of shocking students,I’ve actually been working on an early weeks lesson about the median that looks at all the options in quantile() in R– not really the math of them, but just the fact that they exist.

]]>Examples of logical usage given this definition:

1) from 1000 satellite photos of foliated ground-cover in a certain area, fit a model for the frequency distribution of IR intensity in a certain band (a measure of foliage). Then, if you get a new photo, you can ask whether the amount of foliage is outside the range of what you’d expect so that you can detect where lakes, developed land, or recent forest fires occurred.

2) from a continuous recording seismometer record for a year, select one thousand 10 second intervals at random. Assuming moderate sized or larger earthquakes occur infrequently (on the order of once every year), all of the 10 second intervals will be noise. Fit a frequency distribution to the intensity of the signal in each interval (say sum of squared acceleration in a certain frequency band). Now, for every 10 second interval in the whole record determine a p value relative to this fitted background noise, and flag all intervals with p < 0.0001 as unusual and needing further study. Now, instead of intensely signal-processing 3.2 million intervals per year, you can signal-process 320. Adjust your p value as needed to allow you to detect smaller events.

3) From a random selection of 1000 credit card owners, and 10000 non-fraudulent transactions evaluate a proposed fraud risk score function F. Now, from a hand-picked selection of fraudulent transactions, evaluate F. If the F under known fraud transactions routinely has low p value under the non-fraudulent reference distribution, start using p(F) as a measure of whether to flag transactions for analysis of risk by a human.

In the absence of a meaningful stable repetitive process and a null model fit to data from that process, a p value is logically useless. In all other cases, logic dictates Bayesian analysis of a model to understand what is going on.

Just teach the truth, not the cargo cult.

]]>As helpful as a correct nominal definition of p-values and an ability to discern what is and is not a p-value may be, arguably the critical conceptualization involves what to make of a p-value, and thereby in turn, the study under consideration.

Even correct nominal definitions are hard to get agreement on. The American Statistical Society’s recent statement on p-values had 20 published comments. Some support a version of “the null hypothesis is that the treatment and the placebo produce the same effect” but more argued for what I believe is a more purposeful version. This being that it also includes a myriad of background assumptions (e.g. random assignment, lack of informative dropout, pre-specification rather than cherry picking of outcome, etc.). Essentially everything that is required so that the distribution of p-values when the “treatment and the placebo produce the same effect” is (technically) known to be equal to the Uniform(0,1) distribution. Without that last bit of knowledge, no one could sensibly know what to make of a p-value. Even with it, it is very hard, if not debatable.

As for those outside the debate (practicing statisticians and others), perhaps the best sense for them of what to make of a p-value (and thereby in turn the study under consideration), would be simply – if its less than 0.05 folks are likely somewhat over excited about the study and greater than 0.05 likely somewhat overly dismissive of the study. Then perhaps they can better focus on all the other issues involved for understanding studies or at least not overlook them.

I see Andrew’s suggestion below as focusing on something else that has a seemingly simpler nominal definition and more easily discerned instances. What to make of it in turn what it means for the study under consideration is still very hard – but I think students are less likely to get hung up and confused on it and can focus better on other issues for understanding studies.

Now for the seemingly qualification, what if the distribution is highly skewed (i.e. implicit need for approx Normality) or and the estimate confounded so biased and the chance the estimate is within 2 standard errors of the true value is anywhere from 0% to 100% (unless the standard errors are really small compared to the bias in which case it is close to 0%.)

]]>I hope this one was meant as a joke.

About p-values, I think that p-values suddenly become very plausible as a tool for decision making if we repeatedly run an experiment and steadily get low p-values. However, we could have determined this without using p-vals, what they call the secret weapon in the Gelman and Hill book (in a footnote IIRC!). Just plot the means and CIs under repeated runs.

]]>I am not a computer scientist by training so I don’t know if that is true. What I do know, however, is that people have routinely made such comments by underestimating human ingenuity. We don’t know how bad things can get unless we have enough people who are incentivized enough to try. I am, therefore, a bit skeptical about such claims.

Furthermore, the people who work at Microsoft are without a doubt very clever and very good at what they do. Probably they are just as clever as those who work to develop Linux and/or Apple’s OS. I, therefore, doubt stories that there is an inherent design flaw in Windows that cannot be compensated for. Moreover I doubt that this design flaw is absent in Linux. All I know is that people have much more of an incentive to look for such flaws in Windows than they do in Linux.

The same story applies to rival methods in statistics.

]]>Even if we all used Linux & all the virus writers in the world focused on writing Linux viruses things will never get as bad because the architecture of the system inherently contains most of the damage a virus can inflict.

Ergo, all designs / strategies are not equally robust.

]]>A solution to the problem of dichotomisation of P-values will not be available until researchers are coaxed into a mindset where the evidential meaning of the results is evaluated as part of the inferential process. The conventional dichotomisation precludes any proper evaluation of the evidence in the data. Selling the idea that the evidence is important should not be too difficult.

]]>I don’t think these kids should be taught p-values at all. But I did teach that there’s a 95% chance that the estimate is within 2 standard errors of the true value.

]]>So even agreeing with all the above, I still am thinking how to explain in understandable terms to that young student what is a p value and what do you do with it? Feeding them many of the recommendation of the paper (ie a good dose of Bayesian) will not do the trick. We have professional researchers above debating truth and consequences without full agreement. Thoughts?

]]>(1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

(1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

(1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant using it. Psychological Inquiry, 1, 108-141, 173-180.

(1997) The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow, S. A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 393-425). Mahwah, NJ: Erlbaum.

http://meehl.umn.edu/recordings/philosophical-psychology-1989

Also, LL Thurstone and Harold Gulliksen are great examples of how to approach psych problems without NHST:

Thurstone, L.L. (1930). The Learning Function. The Journal of General Psychology 3, 469–493. doi:10.1080/00221309.1930.9918225

Thurstone, L. L. The error function in maze learning.J. gen. Psychol., 1933,9, 288–301. doi:10.1080/00221309.1933.9920938

Gulliksen, H. A rational equation of the learning curve based on Thorndike’s Law of Effect.J. gen. Psychol., 1934,11, 395–434

Thurstone, L. L. “Psychology as a Quantitative Rational Science.” Science 85, no. 2201 (1937): 227-32. http://www.jstor.org/stable/1662685.

Gulliksen, H. A generalization of Thurstone’s learning function. Psychometrika (1953) 18: 297. doi:10.1007/BF02289265

Gulliksen, H. (1959). Mathematical Solutions For Psychological Problems. American Scientist, 47(2), 178-201. http://www.jstor.org/stable/27827302

That should show you there is a light at the end of the NHST doomcave at least…

]]>I don’t dispute that those methods can do a lot. However, p-values have been very useful when used properly and seemed to have lead to some advancements in science. For example, the Higgs-boson, if I recall correctly, was announced as discovered because the p-value in an experiment was below a certain threshold. I think that this is an example of the p-value being a very useful tool.

]]>It is however still the case that the randomization test tests the hypothesis of “no difference in distribution detectable by test statistic foo” not “the strong null of no effect on any individual”

To make this mathier, suppose you have a measured finite population of vectors A with distribution D(A) and a similar population of B both drawn using RNG from a super-population U such that A+B = U. Where + is taken to be the union of the two sets of measurements.

we have some test statistic t(D(Foo)) which can take a finite population of measurements Foo with distribution D(Foo) and turn it into a number. Examples are “take the average” or “take the median” or “average f(x) across the x” or “take the inter-quartile range” or “calculate the sum of squares” or whatever. The randomization test tests the following

t(D(A)) is / is not outside the 95%tile bound of t(D(Resample(A+B))) for some allowable set of resampling possibilities.

But that is very different from “there is no effect on any individual element of A”. Whenever the effect on A is such that a moderate number of Resample(A+B) populations have the same t value you will not be able to detect the difference even though from a practical perspective the effects could be practically very significant for the members of A, such as the wealth transfer example where the only information incorporated into t is the final bank account balances (but I agree, not when t incorporates both clothing choice and final account balance and there is a causal connection between clothing choice and pre-balance for example).

]]>I never said that other methods are immune from problems, nor do I believe such a thing. What I do think is that p-values, even when they’re appropriate, don’t do much. Even when used properly, I don’t think p-values are so useful. And, yes, all methods can be abused, but some methods are more useful than others. There’s a reason why I wrote a book on Bayesian data analysis and a book on regression and multilevel models, and the reason is that these methods can do a lot!

]]>Do you genuinely think that other methods are for some reason immune from this problem once they’re adopted widely? It used to be the case, or so I am told, that Linux and Apple’s OS were relatively virus free. Now that these operating systems are more widely adopted, more viruses are being made for it. I am sure that if Bayesian methods will be adopted more widely, similar problems will appear. The problem seem to be the incentives not the methods.

I’ll immediately admit though that I have very little knowledge of Bayesian methods. I am sure, however, that you’d be able to game the system using these methods if you’d want to. Surely, it’s not the first human invented technology that’s both idiot-proof and that cannot be used by knaves for their purposes?

What I however meant by that the p-value is a useful tool, is that it is a useful tool if used properly.

]]>This is only true when the sample size is very large. When the sample size is small, the opposite is true, statistical significance means that the observed effect size must be big enough to get outside the confidence bounds, and with small sample size the confidence bounds are large, hence if there is a spurious noisy result, it will appear large. This is what Andrew calls “Type M” errors.

]]>I think what I proposed is quite standard. You have a treatment group A and a control group B. You would like to now if the treatment has an effect on, say, redistribution of wealth. You measure somehow the thing in each group, a very simple model may be to look at the correlation between clothes (a proxy for pre-treatment wealth) and accounts (a proxy for post-treatment wealth). Lower correlation, higher redistribution. Or you estimate the “natural” wealth and compare to the “actual” wealth, whatever. Quite often, the measure will be different for group A and group B. To see how much this difference is suggestive of a real effect, you assume there is no effect and there is no difference between the treatment group and the control group. You are then allowed to exchange the labels freely. You randomise, recalculate your statistic and look at where your outcome falls in the distribution under the null hypothesis.

If your point is that it works only because I’m looking at a relevant statistic that might actually give information about the effect I’m trying to detect, then you are not wrong. We are in complete agreement, I could do much worse and try to detect redistribution of wealth measuring something unrelated.

]]>No, the sample effect size can be a terrible, terrible proxy for the population effect size. See here.

]]>I’ve seen p-values be useful in some settings. But, speaking pragmatically, I wouldn’t call them *very* useful.

So, if you’re saying that you can put a causal model and a randomization test together to get secret sauce, then yes, that’s fine, but without the causal model, the randomization test doesn’t solve the problem of figuring out the *individual effects* because you need to do the estimation of the unobserved stuff using the causal model.

]]>Has your opinion on p values shifted so much in the past four years or am I misinterpreting somethign?

]]>Of course, this presumes that we have properly controlled for various biases. Most of the projects I work with have been retrospective, rarely nicely randomized. I’ve seen a lot of stuff published with retrospective data where the groups were not particularly well matched and no attempt was made to control for the biases — this sort of thing is disturbing, as a lot of MDs assume anything that gets through peer-review is correct, and practice accordingly.

]]>What is the effect of giving a drug on the concentration of hormone X that unbeknown to us affects different versions of the hormone receptor differently?

What is the effect of some wealth-transfer policy on various measures of well-being of a population in a county?

]]>and that is the point I’m trying to make above, because that is what “the individual effect of the treatment” means.

]]>Now the simplest case of Fisher strict null is when there are two groups and the outcomes are binary – Fisher’s Exact test which just an excuse to distract people to a really neat poster that Andrew may have forgotten about.

http://statmodeling.stat.columbia.edu/2007/08/21/ken_rice_on_con/

]]>‘For a small experiment, the use of randomization is simple. We decide in advance which assignments (of experimental units to treatments) are acceptable and what response is to be studied. We choose one of the acceptable assignments “at random”–where the choice can involve a neutral umpire– we conduct the corresponding experiment, we analyze the single set of responses once for each acceptable assignment (analyzing “as if” that assignment had, in fact, been used). Finally, we take all the results, sort them from negative to positive, and ask where the single actual result falls. If it falls in the extreme 5% of all results, we call it “significant.”‘

The key word here is *result*, as in “ask where the single actual result falls”. That result will be mu1 – mu2, so once again I do think mu1 – mu2 = 0 is a central aspect of the null hypothesis being tested.

]]>Suppose you have two populations. A_i and B_i. We assume that these are drawn from a super-population U_i by random number generator, and that they are large populations.

Now, we give a treatment to group A and placebo to group B with blinding of everyone involved.

Now suppose that we divide group A into Aa and Ab by random number generator, and the treatment we give is essentially equal in effect to Aa_i exchanging their outcome with the outcome of Ab_i. This clearly leaves the population distribution A_i unchanged. However, there could potentially be large effects on individuals. For example, suppose the “exchange” is that a bank-account balance is transferred between two random people. Now some poor people are rich, and some rich people are poor.

We can generalize this to any set of effects that vary across population A and leave the distribution unchanged…. Unless you are looking at before-after measurements within-person, then you couldn’t detect this using randomization testing. If you ARE looking at before-after measurements, then no test is needed, simply see if after was different from before.

]]>