Wow, good catch. Somebody call Daryl Bem!

]]>This does seem like a red flag – are the studies designed to have adequate power for “meaningful” learning effects? ]]>

The generalized inverse problem section is also really nice.

]]>There are many types of logics, classical logic, Fuzzy Logic, Para-consistent logics, Bayesian logic, classical-statistical logic and so on. Each type of logic has its own definition for coherence, which, in general, is not in line with the other types of logics. It is a great mistake to use one type of logic to interpret another type without further considerations. See, for instance, the “MIU” and “pq-” postulation systems* and the problem of interpretation.

Trafimow (2014) interprets the p-value, which is a concept defined inside the classical statistical formulation, within a Bayesian framework. This is a huge mistake, since their domains of application are very different. If one studies the statistical theory behind the classical concepts will realize that the p-value cannot be defined using a conditional probability, since H0 imposes restrictions in the probability measures that possibly explain the observed events in the sigma-field and these statements are not measurable in the classical structure. Therefore, any types of conditional statements on them are not well-defined in the classical framework. If you turn these statements measurable (by defining a larger space), you are imposing further restrictions where initially they were not necessary. These further restrictions impose a type of interpretation that does not exist inside the classical model. In this context, you are making a Bayesian interpretation of a classical concept. Of course, once it is understood the classical model (which is bigger than any probabilistic model) and the Bayesian model (which is a probabilistic model in essence) you realize that such interpretation is very limited and misguides the practitioner intuition. This is a common problem in modern statistics, people do not care much about formal notations and the theory behind the concepts. This leads to many types of invalid interpretations and feed many controversies.

Links:

* http://www.math.mcgill.ca/rags/JAC/124/ToyAxiomatics.pdf

Trafimow (2014) http://homepage.psy.utexas.edu/HomePage/class/psy391p/trafimow.nhst.21003.pdf

]]>http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2038856/pdf/brmedj03566-0003.pdf

It is an interesting paper. I disagree that the data in table IV is at all a “huge signal”, in the sense that a signal should be informative. In fact they already knew the lung carcinoma and control groups were substantially different for other reasons (table II), there was no reason to think the “null hypothesis” of precisely no difference should hold after that. The difference between smokers and non-smokers only corresponds to ~20 people, which is the same order of magnitude as the the social class and place of residence differences shown in that table II.

The difference is also an order of magnitude less than the number of misdiagnoses they report correcting (e.g. table XII). For some reason people diagnosed with “respiratory disease, other than cancer (n=335)” (Table X) appears to have *not* been included in the control group, which is inconsistent with the methods they describe (“non-cancer control”). It is also strange for that sample size to be the same as that of “Other Cases” shown in Table I.

I got a strong vibe of p-hacking from that paper. Still Table V is rather convincing regarding a link to lung cancer, at least for a 2 pack a day habit.

]]>I have the ~~confidence~~ *rose tinted spectacles of cognitive & confirmation biases* to see things in data graphs and to be sure about them…

You’re welcome.

]]>#1 would cause me to say, Hmmm, I don’t know. Too small a study.

Not sure what the rigorous theory says is the right thing to be doing.

]]>1, 34

3, 32

and

5, 170

15, 160

and

20, 680

60, 640

The last table is (modulo a little rounding) Doll and Hill’s 1950 data on smoking and lung cancer, which has a huge signal Doll and Hill didn’t expect to see. The other two are (obviously) just scaled-down versions of the same thing. I chose the middle table because, very roughly, it gives p=0.05 – which I don’t claim should or will convince anyone of anything, it just makes for an interesting comparison – and the top one because it’s the far extreme that retains integer counts.

I think the visual impact of plotting these should all be identical, unless one “cheats” in the sense I defined earlier. This would suggest that we do need inferential tools. Which is perhaps not very controversial here, but I’m interested to see how you get on with them.

]]>Yes, I’m still reading. :) And maybe you are right. I don’t know.

Do you have an actual example? I’d be curious to see an example where the effect isn’t evident on seeing the plot / table / summary statistic but seeing (say) a NHST convinces one that it is clearly an important effect.

]]>Plots of a 2×2 table with entries (3,4,5,6) should look identical to plots for (3 million, 4 million, 5 million, 6 million) – and so would any summary statistics computed using only proportions. You might be able to “see” an effect, but you’ll have no indication of how noisy the data’s description of it is.

Unless, that is, you cheat and make plots with inferential summaries on them, or count standard errors as non-inferential data summaries.

If you have a way round this problem I’d be very interested – thanks.

]]>The place to look is Jaynes’s book and several of his papers where he talks about the chi-squared test being an approximation to entropy (or at least it should be called entropy, usually it’s called relative entropy or Kullback–Leibler divergence or something.) You may want to look at the papers “Where do we Stand on Maximum Entropy?” and “Concentration of Distributions at Entropy Maxima”.

But here’s a deeper and simpler explanation for what I think is going on. So imagine a Bayesian world where P(x|A) is modeling the uncertainty about some true value x* rather than frequencies. The high probability manifold (HPM) of P(x|A) is a kind of bubble around x* (if you’ve done your modeling right that is!) that describes an uncertainty range for x*.

In that world it’s important to check sometimes whether a value like x* is in the HPM of P(x|A). You could think of this in several different ways. In some instances it’s equivalent to checking whether a Bayesian Credibility Interval for x* is an accurate prediction for x*’s location. In other model building instances, it’s equivalent to performing post posterior check in the style of Gelman.

That’s all completely general. Now specialize this to the case of repeated trials. It’s helpful to think of a concrete example, so use Jaynes’s famous dice example. Let x_1, … ,x_n be a sequence of dice rolls where each x_i has one of the values 1,2,3,4,5,6. Imagine we have a sequence of observed values x*_1,…,x*_n which we want to check whether it’s in the HPM of some P(x_1,…,x_n|A), just as described for the general case.

In Jaynes dice example n=20,000, which is a very inconvenient distribution to deal with. To make life simpler, we can instead use P(x_1,…,x_n|A) to derive a distribution on frequencies distributions P(f_1,…,f_6|A). Using the observed x*_1,…,x*_n we can trivially compute an observed frequency distribution f*_1,…,f*_6. So rather than check if x*_1,…,x*_n is in the HPM of P(x_1,…,x_n|A) we can check instead whether f*_1,…,f*6 is in P(f_1,…,f_6|A).

This is a massive convenience because it reduces the dimensionality of the problem from n=20,000 down to 6.

One more fact before getting to the punch. For a wide and common class of distributions on x_1,…,x_n (essentially distribution of exponential type, like all the common distributions taught in stats books), you get an interesting phenomenon where P(f_1,…,f_6|A) is very sharply peaked about some modal value f’_1,…,f’_6. This modal value turns out to be a Maximum Entropy distribution itself just like all the common distributions (normal, poisson, …) taught in statistics.

So here finally is the rub: checking whether x*_1,…,x*_n is in the HPM of P(x_1,…,x_n|A) for n greater than about 30 or so, is equivalent to performing a classical chi-squared test using f*_1,…,f*6 and f’_1,…,f’_6.

Bottom line: the legitimate parts of Frequentistism fall out of Jaynes’s probability theory as special cases.

]]>Just on a side note, do you know of some post or paper where such a link is described? Not questioning it, just wanting to know more :)

]]>Do you have an example, of an effect that you care about which graphs etc. wouldn’t have revealed to a naive observer but systematic NHST would have?

]]>Assuming that your intuition about what is, say, meaningful, does not only concern the size of an effect but also whether what was observed could be distinguished from meaningless random variation, and assuming that it works well – do you really think many people have such a well working intuition without having at least computed some tests (or something similar) on data they knew to build it up?

]]>I work in a regulatory agency, that where lines belong, but they can be argued away if the situation dictates.

]]>(For the record I like the idea of placing more emphasis on graphs and descriptive statistics – but I’m worried that it is really quite difficult to interpret these without reference to error. Indeed standard errors and confidence intervals are most useful to my mind when viewed as descriptive rather than inferential statistics.)

]]>Presumably we would not see an outcry if a journal made it explicit that they would not accept divine revelation as a justification for scientific claims; of course we wouldn’t, because part of science is making these calls. If you want to argue against the policy, argue that they’ve drawn the wrong line. Arguing that these lines can’t or shouldn’t be drawn is absurd.

]]>I am not necessarily defending p-values, just saying that the supposed remedy may only be treating the symptoms, and there is a good chance it is making things worse.

It would be better to do like PlosOne: Publish on the basis of the research design and question, not on the results.

]]>If the problem is with editors wanting to publish “findings” or effects, you get p-hacking.

Now suppose you abolish p-values. We are only allowed descriptive stats and visual contrasts. However, you retain the need for “findings”.

Well, you know what you are going to get. You torture the descriptive stats, tables, and displays until you “see” a “finding”.

]]>Now its sinking in…

To replicate, criticize a comment made on this blog.

Brilliant (and until you find group B, group A probably really can’t get it.)

]]>For example, if your first p-value is 0.051 you might just twitch a little to get it down to 0.05. But if all you have, say, is a box plot of outcomes in treatment and control, and some implicit ocular cutoff by the reviewer, you might twitch a lot to get those boxes to tell a story.

Moreover, in doing this you will discover that besides statistical degrees of freedom you now have visualization degrees of freedom. You know, play around with scale, axes ranges, etc… until you realize: “The p-value is dead, long live the junk chart!”

The moral of the story is that you don’t cure a cold by blowing your nose.

]]>No one is talking about whether they have a right to do it.

I reiterate as simply as I can, *any trend toward banning methods will do considerable harm to Bayesian Statistics in the long run*. Bayesians shouldn’t do it and shouldn’t advocate for it.

(From wiki)

In his “F.R.L.” [First Rule of Logic] (1899), Peirce states that the first, and “in one sense, the sole”, rule of reason is that, to learn, one needs to desire to learn and desire it without resting satisfied with that which one is inclined to think.[112] So, the first rule is, to wonder. Peirce proceeds to a critical theme in research practices and the shaping of theories:

…there follows one corollary which itself deserves to be inscribed upon every wall of the city of philosophy:

Do not block the way of inquiry.

Peirce adds, that method and economy are best in research but no outright sin inheres in trying any theory in the sense that the investigation via its trial adoption can proceed unimpeded and undiscouraged, and that “the one unpardonable offence” is a philosophical barricade against truth’s advance, an offense to which “metaphysicians [journal editors] in all ages have shown themselves the most addicted”.

]]>I first saw the IOT test attributed to or described by Tukey, but I don’t really have a source (I have Edwards et al. somewhere but not with me; I haven’t seen Bookstein–thanks, Martha). A quick Google verbatim search on “intraocular trauma test” and “interocular trauma test” seems to turn up a significant number of both alternatives. Perhaps the intraocular folk are looking at the ocular system (two eyes), and the interocular folk are looking at eyeballs.

]]>You recommend “persuading researchers of the errors of their ways.” I’ve found it difficult (but not always impossible) to persuade researchers of the errors of their ways. But it’s often not so hard to persuade researchers of the errors of *other* researchers’ ways. So sometimes we can proceed in crab-like fashion, first criticizing group A for some flaw, then group B sees the problem and tries to avoid these errors in their work; then we criticize group B for *their* errors, etc.

And this is all ok. There should be no embarrassment in having made a mistake; that’s how science works. What’s embarrassing is the people who don’t admit their mistakes, who don’t admit that their prized statistically significant finding might not represent any such pattern in the larger population.

The editors of this particular journal can do whatever they want. Maybe the well-publicized policy change will have some positive larger effect in research practice, maybe not. It’s a bank shot either way.

]]>You might be an academic if you think a reviewer pointing out logical inconsistencies is the same as blanket bans on methods.

You might be an academic if you think wholesale bans of methods isn’t censoring them and isn’t censoring the people who wish to use them.

You might be an academic if you think science progress depends on choosing just the right journal policies rather than persuading researchers of the errors of their ways.

]]>This is not about censorship. This is merely a (debatable) journal policy issue.

]]>I read this a couple of days ago and I did not email Andrew figuring that he certainly would get it from various sources.

]]>I like the basic idea underpinning the ban, but I do think this is too blunt an instrument. As others have noted, the problems with NHST have much more to do with the training of the people (mis)using and (mis)interpreting the statistics than with the statistics themselves. The approach BASP has taken is a crude approach to try to force people to do better, but I doubt it will be effective. The problems with research in the social sciences, particularly in social psychology, run much deeper than p-values; correcting those problems is likely to be a long and slow process, not one that can be shortcut by instructing authors to scrub p-values from their manuscripts. I believe that observation has been made in the comments on this blog before — force people misusing frequentist statistics to use Bayesian, and they’ll just misuse Bayesian instead. Perhaps I’ll be wrong and this will spark a revolution, but I’m not holding my breath.

Les Hayduk also brought up an interesting argument on the SEMNET listserv in which he made a case that the SEM chi-square should not be treated the same as other NHSTs. In his view, the BASP ban does not apply to the SEM chi-square. I’m not sure that the editors will appreciate the distinction.

]]>The *inter*ocular trauma test is positive when the result hits you *between* the eyes.

Edwards Lindman and Savage (1963, Bayesian Stat Inf for Psych Research) attribute it to Berkson (1958).

]]>Isn’t it a paradox that so many readers would email Andrew this story knowing that, with high probability:

1. Other readers will also email it,

2. Other readers have read the news by the time you did.

Perhaps in an optimal world he would have received 0 or 1 email.

What does this tell us about reader preferences and beliefs?

]]>Numerical Inference in the Sciences, 2014 ]]>