Anon:

Wow, good catch. Somebody call Daryl Bem!

]]>Your prediction of a horrible president just came true.

]]>Not much. But it tells us quite lot about the cost function of the readers. Cost of sending email to Andrew: low. Cost of not reading/discussing this on Andrews blog: HUUUGE

]]>> Visually, the results often look quite similar for the two groups.

This does seem like a red flag – are the studies designed to have adequate power for “meaningful” learning effects?

+1

]]>To get on the Jaynes appreciation, there’s a nice non-technical discussion of tail-area Bayesian reasoning starting on page 52 here: http://bayes.wustl.edu/etj/articles/inadequacy.pdf

The generalized inverse problem section is also really nice.

]]>There are many types of logics, classical logic, Fuzzy Logic, Para-consistent logics, Bayesian logic, classical-statistical logic and so on. Each type of logic has its own definition for coherence, which, in general, is not in line with the other types of logics. It is a great mistake to use one type of logic to interpret another type without further considerations. See, for instance, the “MIU” and “pq-” postulation systems* and the problem of interpretation.

Trafimow (2014) interprets the p-value, which is a concept defined inside the classical statistical formulation, within a Bayesian framework. This is a huge mistake, since their domains of application are very different. If one studies the statistical theory behind the classical concepts will realize that the p-value cannot be defined using a conditional probability, since H0 imposes restrictions in the probability measures that possibly explain the observed events in the sigma-field and these statements are not measurable in the classical structure. Therefore, any types of conditional statements on them are not well-defined in the classical framework. If you turn these statements measurable (by defining a larger space), you are imposing further restrictions where initially they were not necessary. These further restrictions impose a type of interpretation that does not exist inside the classical model. In this context, you are making a Bayesian interpretation of a classical concept. Of course, once it is understood the classical model (which is bigger than any probabilistic model) and the Bayesian model (which is a probabilistic model in essence) you realize that such interpretation is very limited and misguides the practitioner intuition. This is a common problem in modern statistics, people do not care much about formal notations and the theory behind the concepts. This leads to many types of invalid interpretations and feed many controversies.

Links:

* http://www.math.mcgill.ca/rags/JAC/124/ToyAxiomatics.pdf

Trafimow (2014) http://homepage.psy.utexas.edu/HomePage/class/psy391p/trafimow.nhst.21003.pdf

]]>I should have guessed Jaynes! Will have to reread that. Thanks for the great elaboration!

]]>Christian: I’m confused. Are you saying we should trust our intuition (which is formed by looking at graphs, or by the patterns other people say they see in our graphs), or not?

]]>Mike: Surely making such a statement informed neither by looking at any graph nor at any conclusion I get from them is much less biased.

]]>I gather “Doll and Hill” refers to table IV of this:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2038856/pdf/brmedj03566-0003.pdf

It is an interesting paper. I disagree that the data in table IV is at all a “huge signal”, in the sense that a signal should be informative. In fact they already knew the lung carcinoma and control groups were substantially different for other reasons (table II), there was no reason to think the “null hypothesis” of precisely no difference should hold after that. The difference between smokers and non-smokers only corresponds to ~20 people, which is the same order of magnitude as the the social class and place of residence differences shown in that table II.

The difference is also an order of magnitude less than the number of misdiagnoses they report correcting (e.g. table XII). For some reason people diagnosed with “respiratory disease, other than cancer (n=335)” (Table X) appears to have *not* been included in the control group, which is inconsistent with the methods they describe (“non-cancer control”). It is also strange for that sample size to be the same as that of “Other Cases” shown in Table I.

I got a strong vibe of p-hacking from that paper. Still Table V is rather convincing regarding a link to lung cancer, at least for a 2 pack a day habit.

]]>I have the

~~confidence~~rose tinted spectacles of cognitive & confirmation biasesto see things in data graphs and to be sure about them…

You’re welcome.

]]>Thanks. Well, just my naive intuition: I’d be convinced smoking is bad based on your Table #3. #2 too, though perhaps I’d be less sure.

#1 would cause me to say, Hmmm, I don’t know. Too small a study.

Not sure what the rigorous theory says is the right thing to be doing.

]]>Well, the point holds for any 2×2 table. But if you like, consider the following tables;

1, 34

3, 32

and

5, 170

15, 160

and

20, 680

60, 640

The last table is (modulo a little rounding) Doll and Hill’s 1950 data on smoking and lung cancer, which has a huge signal Doll and Hill didn’t expect to see. The other two are (obviously) just scaled-down versions of the same thing. I chose the middle table because, very roughly, it gives p=0.05 – which I don’t claim should or will convince anyone of anything, it just makes for an interesting comparison – and the top one because it’s the far extreme that retains integer counts.

I think the visual impact of plotting these should all be identical, unless one “cheats” in the sense I defined earlier. This would suggest that we do need inferential tools. Which is perhaps not very controversial here, but I’m interested to see how you get on with them.

]]>@george:

Yes, I’m still reading. :) And maybe you are right. I don’t know.

Do you have an actual example? I’d be curious to see an example where the effect isn’t evident on seeing the plot / table / summary statistic but seeing (say) a NHST convinces one that it is clearly an important effect.

]]>Rahul: Hope you’re still reading this – how would you make your idea work for a 2×2 table?

Plots of a 2×2 table with entries (3,4,5,6) should look identical to plots for (3 million, 4 million, 5 million, 6 million) – and so would any summary statistics computed using only proportions. You might be able to “see” an effect, but you’ll have no indication of how noisy the data’s description of it is.

Unless, that is, you cheat and make plots with inferential summaries on them, or count standard errors as non-inferential data summaries.

If you have a way round this problem I’d be very interested – thanks.

]]>I totally agree. These editors are treating the symptoms, not the cause of problems in social science research. I think scientists should start posting, in a public forum, their hypotheses, methods, n/power, analysis methods (e.g., exactly how outliers will be treated) before they run their study so that reviewers and editors can assess how much people massaged their data in a hunt for something that is publishable (i.e., what I think is the cause of an excessively high false positive rate in the literature).

]]>Rasmus,

The place to look is Jaynes’s book and several of his papers where he talks about the chi-squared test being an approximation to entropy (or at least it should be called entropy, usually it’s called relative entropy or Kullback–Leibler divergence or something.) You may want to look at the papers “Where do we Stand on Maximum Entropy?” and “Concentration of Distributions at Entropy Maxima”.

But here’s a deeper and simpler explanation for what I think is going on. So imagine a Bayesian world where P(x|A) is modeling the uncertainty about some true value x* rather than frequencies. The high probability manifold (HPM) of P(x|A) is a kind of bubble around x* (if you’ve done your modeling right that is!) that describes an uncertainty range for x*.

In that world it’s important to check sometimes whether a value like x* is in the HPM of P(x|A). You could think of this in several different ways. In some instances it’s equivalent to checking whether a Bayesian Credibility Interval for x* is an accurate prediction for x*’s location. In other model building instances, it’s equivalent to performing post posterior check in the style of Gelman.

That’s all completely general. Now specialize this to the case of repeated trials. It’s helpful to think of a concrete example, so use Jaynes’s famous dice example. Let x_1, … ,x_n be a sequence of dice rolls where each x_i has one of the values 1,2,3,4,5,6. Imagine we have a sequence of observed values x*_1,…,x*_n which we want to check whether it’s in the HPM of some P(x_1,…,x_n|A), just as described for the general case.

In Jaynes dice example n=20,000, which is a very inconvenient distribution to deal with. To make life simpler, we can instead use P(x_1,…,x_n|A) to derive a distribution on frequencies distributions P(f_1,…,f_6|A). Using the observed x*_1,…,x*_n we can trivially compute an observed frequency distribution f*_1,…,f*_6. So rather than check if x*_1,…,x*_n is in the HPM of P(x_1,…,x_n|A) we can check instead whether f*_1,…,f*6 is in P(f_1,…,f_6|A).

This is a massive convenience because it reduces the dimensionality of the problem from n=20,000 down to 6.

One more fact before getting to the punch. For a wide and common class of distributions on x_1,…,x_n (essentially distribution of exponential type, like all the common distributions taught in stats books), you get an interesting phenomenon where P(f_1,…,f_6|A) is very sharply peaked about some modal value f’_1,…,f’_6. This modal value turns out to be a Maximum Entropy distribution itself just like all the common distributions (normal, poisson, …) taught in statistics.

So here finally is the rub: checking whether x*_1,…,x*_n is in the HPM of P(x_1,…,x_n|A) for n greater than about 30 or so, is equivalent to performing a classical chi-squared test using f*_1,…,f*6 and f’_1,…,f’_6.

Bottom line: the legitimate parts of Frequentistism fall out of Jaynes’s probability theory as special cases.

]]>” I have good reason to believe for example that the chi-squared test with p-value is actually a Bayesian procedure in Frequentist clothing.”

Just on a side note, do you know of some post or paper where such a link is described? Not questioning it, just wanting to know more :)

]]>This obviously depends on who counts as “naive observer”. I have given statistical advice to a not so small number of PhD students and early career scientists, and I often tell them that they should look at their data and not rely on tests, at least not on as many as they feel they need to do. My experience is that many of them are very, very wary of trusting anybody’s intuition and think that numbers are much better. When I look at their data and claim that it’s clear for everyone to see what is going on, some realize that they can see it indeed, but some would just insist so strongly that they need an “objective” number that at the end I don’t know whether they just don’t trust their intuition (and could be helped by advertizing intuition over p-values more) or whether they don’t even have one. I have the confidence to see things in data graphs and to be sure about them, but I still think that playing around with p-values, among other things, helped me building this up.

]]>Christian Hennig:

Do you have an example, of an effect that you care about which graphs etc. wouldn’t have revealed to a naive observer but systematic NHST would have?

]]>Bookstein (p. xxvii) quotes Edwards, Lindman and Savage as saying (p.217): “It has been called the interocular traumatic test; you know what the data mean when the conclusion hits you between the eyes.” He calls it the ITT.

]]>Rahul: I think I learnt quite a bit about what can happen (or not) just because of random variation from running tests and computing p-values.

Assuming that your intuition about what is, say, meaningful, does not only concern the size of an effect but also whether what was observed could be distinguished from meaningless random variation, and assuming that it works well – do you really think many people have such a well working intuition without having at least computed some tests (or something similar) on data they knew to build it up?

]]>Richard: We will agree to disagree.

I work in a regulatory agency, that where lines belong, but they can be argued away if the situation dictates.

]]>(For the record I like the idea of placing more emphasis on graphs and descriptive statistics – but I’m worried that it is really quite difficult to interpret these without reference to error. Indeed standard errors and confidence intervals are most useful to my mind when viewed as descriptive rather than inferential statistics.)

]]>The journal is saying that there are some ways of justifying claims that will not fly in the journal. This is true of *every* scientific journal. The journal is not censoring particular scientific claims, but is rather regulating the way in which those claims are argued for. Pickiness about methods is actually the hallmark of science, not the death of it.

Presumably we would not see an outcry if a journal made it explicit that they would not accept divine revelation as a justification for scientific claims; of course we wouldn’t, because part of science is making these calls. If you want to argue against the policy, argue that they’ve drawn the wrong line. Arguing that these lines can’t or shouldn’t be drawn is absurd.

]]>+1

]]>PS And because the nature of the descriptive semantics is more coarse, you might just have to do more torturing than with p-values…

I am not necessarily defending p-values, just saying that the supposed remedy may only be treating the symptoms, and there is a good chance it is making things worse.

It would be better to do like PlosOne: Publish on the basis of the research design and question, not on the results.

]]>@Anon

If the problem is with editors wanting to publish “findings” or effects, you get p-hacking.

Now suppose you abolish p-values. We are only allowed descriptive stats and visual contrasts. However, you retain the need for “findings”.

Well, you know what you are going to get. You torture the descriptive stats, tables, and displays until you “see” a “finding”.

]]>This does not make any sense…

]]>I think of a journal much like I think of a park, or a nice museum or a shopping mall. We’ve all heard that we shouldn’t litter in parks, or touch the artwork, or be a nuisance at the mall. The social rules exist, but probably some people care about them more than others. It’s almost romantic to think that scientists are this magical group of people who will abide by these social rules just on the basis of their own virtue, but here is a journal that is abandoning that assumption, and is choosing to enforce a littering fine, and will expel people who run around the mall naked or who scratch the paintings to smell them. No one needs to go to this particular park, but it’s nice to know that it’s there if I ever want to be somewhere with no litter.

]]>> It’s a bank shot either way.

Now its sinking in…

To replicate, criticize a comment made on this blog.

Brilliant (and until you find group B, group A probably really can’t get it.)

]]>Um, speaking for myself, I’m not worried at all about wider bans on p-values and NHSTs. I care about this because I find it interesting/entertaining to read and think about some really off-the-wall opinion on a topic about which I know something.

]]>For example, if your first p-value is 0.051 you might just twitch a little to get it down to 0.05. But if all you have, say, is a box plot of outcomes in treatment and control, and some implicit ocular cutoff by the reviewer, you might twitch a lot to get those boxes to tell a story.

Moreover, in doing this you will discover that besides statistical degrees of freedom you now have visualization degrees of freedom. You know, play around with scale, axes ranges, etc… until you realize: “The p-value is dead, long live the junk chart!”

The moral of the story is that you don’t cure a cold by blowing your nose.

]]>“The editors of this particular journal can do whatever they want.”

No one is talking about whether they have a right to do it.

I reiterate as simply as I can, *any trend toward banning methods will do considerable harm to Bayesian Statistics in the long run*. Bayesians shouldn’t do it and shouldn’t advocate for it.

Richard: The below is what I am concerned about, but simply put you can argue for your methodology, claims and logic but only if it does involve what appears to be NHST (of course authors will find away around this)?

(From wiki)

In his “F.R.L.” [First Rule of Logic] (1899), Peirce states that the first, and “in one sense, the sole”, rule of reason is that, to learn, one needs to desire to learn and desire it without resting satisfied with that which one is inclined to think.[112] So, the first rule is, to wonder. Peirce proceeds to a critical theme in research practices and the shaping of theories:

…there follows one corollary which itself deserves to be inscribed upon every wall of the city of philosophy:

Do not block the way of inquiry.

Peirce adds, that method and economy are best in research but no outright sin inheres in trying any theory in the sense that the investigation via its trial adoption can proceed unimpeded and undiscouraged, and that “the one unpardonable offence” is a philosophical barricade against truth’s advance, an offense to which “metaphysicians [journal editors] in all ages have shown themselves the most addicted”.

]]>Oops. What you say makes sense, George. Perhaps I’ve been applying the wrong test all along. :-)

I first saw the IOT test attributed to or described by Tukey, but I don’t really have a source (I have Edwards et al. somewhere but not with me; I haven’t seen Bookstein–thanks, Martha). A quick Google verbatim search on “intraocular trauma test” and “interocular trauma test” seems to turn up a significant number of both alternatives. Perhaps the intraocular folk are looking at the ocular system (two eyes), and the interocular folk are looking at eyeballs.

]]>Anon:

You recommend “persuading researchers of the errors of their ways.” I’ve found it difficult (but not always impossible) to persuade researchers of the errors of their ways. But it’s often not so hard to persuade researchers of the errors of *other* researchers’ ways. So sometimes we can proceed in crab-like fashion, first criticizing group A for some flaw, then group B sees the problem and tries to avoid these errors in their work; then we criticize group B for *their* errors, etc.

And this is all ok. There should be no embarrassment in having made a mistake; that’s how science works. What’s embarrassing is the people who don’t admit their mistakes, who don’t admit that their prized statistically significant finding might not represent any such pattern in the larger population.

The editors of this particular journal can do whatever they want. Maybe the well-publicized policy change will have some positive larger effect in research practice, maybe not. It’s a bank shot either way.

]]>Nobody cares about this journal. They care because of the possibility of wider bans on things like p-values and NHST’s.

]]>We should start a “you might be an academic if …” series. I’ll kick it off.

You might be an academic if you think a reviewer pointing out logical inconsistencies is the same as blanket bans on methods.

You might be an academic if you think wholesale bans of methods isn’t censoring them and isn’t censoring the people who wish to use them.

You might be an academic if you think science progress depends on choosing just the right journal policies rather than persuading researchers of the errors of their ways.

]]>This is not about censorship. This is merely a (debatable) journal policy issue.

]]>I read this a couple of days ago and I did not email Andrew figuring that he certainly would get it from various sources.

]]>I like the basic idea underpinning the ban, but I do think this is too blunt an instrument. As others have noted, the problems with NHST have much more to do with the training of the people (mis)using and (mis)interpreting the statistics than with the statistics themselves. The approach BASP has taken is a crude approach to try to force people to do better, but I doubt it will be effective. The problems with research in the social sciences, particularly in social psychology, run much deeper than p-values; correcting those problems is likely to be a long and slow process, not one that can be shortcut by instructing authors to scrub p-values from their manuscripts. I believe that observation has been made in the comments on this blog before — force people misusing frequentist statistics to use Bayesian, and they’ll just misuse Bayesian instead. Perhaps I’ll be wrong and this will spark a revolution, but I’m not holding my breath.

Les Hayduk also brought up an interesting argument on the SEMNET listserv in which he made a case that the SEM chi-square should not be treated the same as other NHSTs. In his view, the BASP ban does not apply to the SEM chi-square. I’m not sure that the editors will appreciate the distinction.

]]>Intraocular trauma would have you hit in the eye.

The *inter*ocular trauma test is positive when the result hits you *between* the eyes.

Edwards Lindman and Savage (1963, Bayesian Stat Inf for Psych Research) attribute it to Berkson (1958).

]]>Isn’t it a paradox that so many readers would email Andrew this story knowing that, with high probability:

1. Other readers will also email it,

2. Other readers have read the news by the time you did.

Perhaps in an optimal world he would have received 0 or 1 email.

What does this tell us about reader preferences and beliefs?

]]>Yes — see Bookstein, Measuring and Reasoning

Numerical Inference in the Sciences, 2014

…and yet, he was a racist, and the other things mentioned. He may have been one of the top ten for what he accomplished as president, but that doesn’t excuse his personal and moral failings.

]]>