Josh:

I updated the link.

]]>http://biorxiv.org/ was launched not that long ago and operated by the Cold Spring Harbour Lab (which has a good rep). Unsure how popular it’s been though.

]]>Hjk:

Yes.

]]>Is the point here the ambiguity of whether a ‘model’ or ‘theory’ here is a parameter in a statistical model or the model structure itself?

I.e. consider the schema:

S: parameter -> model -> output

That is: integral [p(y|theta)p(theta)]dtheta = p(y)

In Andrew’s example he is saying that T1 and T2 correspond to two *different* instances of S, right? He can compute the probability of the parameters for each case, conditional on the model structure in each case, but these parameters and the model structure are not (yet) directly comparable between models.

If you do want to compare them directly then you need to embed them within a bigger model – e.g. with a continuous parameter for the effect size of ovulation. In this case one may compare different theories formulated in terms of effect size (e.g. is it ‘big’ or ‘small’ or whatever). However, you would still want to check whether your ‘super’ model is reasonable before you believe the effect size estimates. In many cases this might not be true.

Now it may also be possible to make the model structure so general and push all the assumptions into background parameters to be estimated as well (e.g. heading towards nonparametric stuff or hierarchical stuff) that you think you can always fit the data adequately as so don’t ‘need’ to model check.

However, you still have to think about the correspondence between your general model structure and the substantive theory. E.g. is the generality of your statistical model based on scientific understanding or just additional dof introduced to make sure you don’t need to model check? There’s a sort of bias (substantive theory) variance (statistical model flexibility) tradeoff.

In general most of our scientific theories (esp. for ‘noisy’ fields) will be inadequate to fully capture the data – i.e. there is no model structure + parameter combination that fully captures everything in a given data set.

Here you could look to see the reasons it fails and what that tells you, or you could go the other extreme and make models that ‘predict everything and hence nothing’ – i.e. machine learning (which might be a perfectly reasonable option if you don’t care about ‘understanding’).

Relating back to Anon’s I and K comment – I think Andrew is saying he doesn’t care to compute expressions like P(K|…), he’d rather reason along the lines of “Expressions like P(I|data, K) tell you what K implies” (!) when checking background assumptions.

]]>Anon:

Yes, I agree there is no bright line. The point is that when a model is being tested, or two models are being compared, in this way, it is the *statistical model* being tested or compared, not the *scientific hypothesis*.

To compute the Bayes factor, say, is to compute the relative probabilities of two statistical models in a very narrowly defined statistical framework. And it turns out that, for lots of the sorts of statistics problems where Bayes factors are promoted, that the Bayes factor is highly sensitive to arbitrary and essentially uncheckable aspects of the model, aspects of the prior distribution that don’t really affect posterior inference *conditional* on either of the models but which have a huge effect on the Bayes factor. So, for this technical reason, I think that Bayes factors typically don’t work.

(different anon)

@christian

I have no problem with frequentism as a calculation. After all calculations on frequencies are just a useful mathematical approximation.

I do have a problem with frequentism as a language for the scientific method. In particular hypothesis testing is such that you never pose a model of the hypothesis in question. If you only ever falsify null hypotheses, you end up learning very little beyond that null hypotheses are wrong.

Once you stop testing the theory in question, in my opinion, you’ve left he scientific method behind.

Hypothesis testing _never_ explicitly tests the theory in question. It only implicitly tests the theory in question under these so-called “severity” conditions, which happen to line up with the Bayesian condition in which the posterior converges sharply around the hypothesis in question if and only if the null hypothesis is false.

]]>(different anon)

@andrew I understand you don’t want to get into the habit of taking statements like “model is true” literally, but honestly it seems to be blurry to me. Effect/parameter estimation just seems like a stand-in for a very simple model.

In other words, isn’t “probability effect is between X and Y” the operationally equivalent as “probability of a model which includes all possible effect sides being between X and Y”?

]]>To follow up on artkqtarks comment:

The open-access part can be really important in reaching people who don’t have cheap access to peer-reviewed journals. One example I encountered recently: Some biologists I know are interested in improving practices in making inferences from paleontological data. But many paleontologists work for natural history museums that do not have the library resources/access that universities have. So these folks weighed the advantages and disadvantages of publishing in a regular journal vs a PLoS journal, decided on the latter, and ended up glad they did: The paper got an unusually large number of downloads. (I think they also gave short talks on it at paleontology meetings, which helped publicize it)

I suspect that Ioannidis’ reasoning in publishing his 2005 paper in a PLoS journal was similar: It would enable many more medical professionals and interested lay persons to have access to the entire paper, not just the abstract or a popular press summary.

So pay-to-publish can be done for altruistic reasons.

]]>Biomedical fields are certainly not very “enlightened.” Varmus writes, “Status conferred by the acceptance of papers in journals like Science, Cell, and Nature, or even in subsidiary journals of these “flagship” periodicals (e.g., Molecular Cell or Nature Biotechnology) has an indisputable effect on the process of recruitment and promotion of faculty.” This is a good description of the reality, even though this kind of attitude is silly, given that papers of questionable qualities often appear in these journals.

Besides the predictable oppositions from the publishers, I guess the researchers were worried whether peer-reviewed journals would accept a report that had been posted on E-biomed (What’s the point of posting the preprint if I lose the chance to publish it in Cell by doing so?) and whether the reports would go through peer reviews before posted on E-biomed (How will I know if the study is any good if it haven’t been peer-reviewed?).

I think founding PLoS journals was an attempt to establish respectable peer-reviewed journals that are also open access. An open access alternative to Cell, if you will. They adopted the author-pay model to make it financially viable.

]]>Rahul says:

March 26, 2015 at 7:21 am

It is this myth “if only a statistician was involved” (aka if only the king knew) that bothers me.

My first hand experience (which ain’t a representative random sample) is that more often than not they don’t make things better and even often worse (though no doubt the _right_ statistician could have made things better).

As Christian pointed out, there are “requirements” of self-marketing but my sense is also lack of training/experience and mentor-ship.

What’s the primary motivation for statistical society’s Professional Development courses?

Raising money for the meeting (e.g. get well known speakers who will draw a crowd).

What’s the primary requirement for publication in a statistical journal?

Adequate technical development (as evidenced by difficult math) but well known authors get an exception as they have already been accredited as “real” statisticians (i.e. journals are not about communicating important ideas but rather providing input to academic hiring and promotion.)

Lots of good ideas here! I like pretty much everything that was suggested after my last posting. Pity that this is just the comments section of a blog.

]]>Anon says:

March 25, 2015 at 4:53 pm

The money and governance or ability to regulate. Regulatory agencies (e.g. FDA) get close to your ideal (barring regulatory capture to effectively block it).

That is required to purposefully manage science – given the politics, economics, psychology, sociology, etc.

(Sometimes in some periods/places hands off management works, but few think that’s today’s situation).

@Keith / Christian:

I was thinking about what you guys described as (a) “statisticians who promised that they could justify stronger interpretations of the data” or (b) “statisticians ….. tend to suggest they can provide much more than is reasonable from the data in hand”

Would it help if “house” statisticians were employed by the Journal? Or at least a panel of statisticians from which the journal assigned one to a submitted article? Somewhat like a reviewer.

I’m thinking there’s a big conflict of interest otherwise. Might it help for the technical-authors to justify their conclusions on their own and the statisticians to act as an independent analyst / watchdog?

I get the feeling that having the statistician as a integral part of the author’s team during the post-study analysis increases the pressures / likelihood of pushing conclusions stronger than justified.

]]>This is in reply to Anon’s suggestions 1-3. For some reason I can’t reply directly to that comment.

Maybe this is naive, but how about this:

4) Open peer review so that reviewers can receive credit/blame for their contributions in some form, and so the work contributed during the review is made public or at least accessible. To encourage this, how about a reviewer metric that evaluates the reviewers’ contribution over time — like an author citation metric. In cases where the reviewers choose to remain anonymous, a private reviewer ID can be used by the journal to forward to a service like Publons so that they still get some credit for the work.

5) An incentive system for post-publication peer review so that post-pub reviewers are not penalized for taking the risk. One possibility is to use the reviewer metric. See (4) for the anonymous reviewer option to apply here.

6) The possibility of pre-registration of the design and analysis protocol to (potentially) de-couple the statement of the hypotheses and analysis scheme from the data-dependent choices made during analysis. The reasons for this would be to discourage HARKing and explicitly acknowledge the contributions made during design. Maybe this could be a part of (3). This could include simulations and what is now described as “power analysis”.

7) A journal/archive section for simulation and/or methods development. This could include documented and reviewed source code. Code contributions and reviews could contribute to one’s author/reviewer metric.

]]>I think there’s an approximate “enlightenment” spectrum in academic publishing. At one end I count the Math / Comp Sci guys where openness seems high: People post manuscripts online, people used to circulate pre-prints, there’s very thriving lists / groups where top notch people blog & critique other papers, ask questions ( e.g. Terry Tao is on MathOverflow. ), codes get posted online, most of the work is done with open source tools etc.

At the other spectrum seem the social sciences. Won’t publish letters critiquing a past paper. Won’t let you publish if you’ve posted online. Replication is resisted. Getting access to raw data is like pulling teeth. etc.

The rest of us seem somewhere in between.

]]>I don’t claim to know what the opposition was like and I’m not sure if the following documents give you a clear enough picture. But they provide some history.

http://www.ncbi.nlm.nih.gov/books/NBK190606/

https://scholarworks.iu.edu/dspace/bitstream/handle/2022/170/wp01-03B.html

http://www.nih.gov/about/director/pubmedcentral/ebiomedarch.htm#Addendum

Keith O’Rourke says:

March 25, 2015 at 4:08 pm

My ideal way forward is to have three types of research.

1) Experimental studies where full datasets are published along with a detailed methodology and some prose discussing anything that may be important noticed during data collection. Some exploratory analysis and parameter estimation may, but not necessarily, also be included.

2) Meta-analytic studies that contain a *detailed* review of the literature. It should be comprehensive or cite a previous comprehensive meta-analysis for anything left out. There would be tables consisting of the primary methodological differences and similarities along with some summarized data. The presence or lack of any “exact” replications would be noted. Obviously there would be discussion and analysis of the stability of the measurements, parameter estimates (distributional, not just mean +/- error), etc.

3)Theory development. Here a mechanistic model is presented, possibly modified from a previously falsified one. This can be deduced from some well defined definitions and postulates (ie a full theory), or be phenomenological (in the sense of MOND). It is shown how the model is consistent with previous data and predictions are made regarding future data.

Why isn’t it like this? What am I missing?

]]>Keith: You’re probably right. I’ve seen statisticians hitting the brakes in some situations but I’ve also seen reviewers using honest modesty against me as an author, and potential project partners lured away by other statisticians who promised that they could justify stronger interpretations of the data.

Statisticians as a whole are probably as affected by the “requirements” of self-marketing as other scientists.

Oh, I wasn’t aware of the history. Is this stuff described anywhere? I’d love to read more.

]]>Anon says:

March 25, 2015 at 3:05 pm

“I would prefer that people recognized that collecting data and describing the methodology in detail is a valid scientific enterprise on its own. That data could then be used by “theorists” to come up with precise predictions, at which point a conclusion can be drawn.”

That was argued for here Greenland S, O’Rourke K. In: Modern Epidemiology (Rothman KJ, Greenland S, Lash TL, eds). 3rd ed. Philadelphia: Lippincott Williams, 652–682; 2008. Meta-analysis. [change “theorists” for meta-analysts]

]]>I agree. But I cannot fault founders of PLoS like Michael Eisen, Patrick Brown, and Harold Varmus. They wanted arXiv for biology, but got rejected by the biomedical community. PLoS was a compromise.

https://twitter.com/mbeisen/status/551395415510118400

“as arXiv published millionth paper, i’ll remind biologists that scientific societies killed arXiv for biology in 1999”

Keith,

They are pretty careful not to “accept substantive hypothesis B” with regard to survival:

“The proportion of cases with 30-day survival was higher than that of the controls with 30-day survival (67% vs. 34%, respectively; P = .02).”

But then drop the statistical ball when it comes to the lab tests:

“IVIG therapy enhanced the ability of patient plasma to neutralize bacterial mitogenicity… There was no difference in the mitogen-neutralizing capacity at baseline for cases and controls (P = .20, Wilcoxon rank-sum test)”

Medical research is hard, real hard, to do right. Observational data is very difficult to conclude anything from without the guidance of strong theory (as in astronomy). In medicine in particular, systematic errors due to varying diagnostic methods lurk about that can cause 90%+ differences between different times/locations. Even in perfect (blinded, no drop-outs, exactly the same baseline) RCTs, the problem of construct validity often remains.

I think Meehl is too optimistic on this front due to lack of direct experience when he writes (in the OP paper):

“If I refute a directional null hypothesis… in a biochemical medical treatment, I thereby prove…the counter null; and the counter null is essentially equivalent to the substantive theory of interest, namely, that…tetracycline [makes a difference] to strep throats.”

Christian wrote:

“Also, personally I think that people tend to jump to too strong conclusions far too quickly. Much of what we do is “exploratory” in one sense or another and the result is not a stab at a “best explanation”, but rather something that gives us somewhat better ideas and knowledge to be used in the next step.”

I would prefer that people recognized that collecting data and describing the methodology in detail is a valid scientific enterprise on its own. That data could then be used by “theorists” to come up with precise predictions, at which point a conclusion can be drawn. Then the forced statements of the form “these results suggestively implied the indication that the treatment may help” could be avoided. The statistician’s role IS important in providing methods by which to perform parameter estimation.

]]>Christian Hennig says: March 25, 2015 at 10:30 am

I agree with a lot you say here but my first hand experiences with other statisticians working with researchers is that they do tend to suggest they can provide much more than is reasonable from the data in hand (perhaps in an attempt to be popular). In fact, its often quite challenging to convince them and the researchers that they are expecting/trying to get too much out of the data.

One example where I was unable to do this but fortunately the journal reviewers were was http://www.ncbi.nlm.nih.gov/pubmed/10825042

In the original submission, I was excluded from authorship, as the other statisticians suggested they could identify the one best model (for removing confounding) and along with the young at the time primary investigator argued that practically speaking they had ruled out confounding as a possible explanation.

(Now some of the authors where expecting the journal to reject and likely went along just to let others learn.)

]]>Anon: The problem is, if I understood you correctly, that what a statistician has to say is often quite dependent on the specific application. To make sense of effect sizes is a task that should be addressed by the subject matter expert and the statistician together. Same applies to whether and when NHST is “valid”. It depends strongly on what you want to know and what you want to do, and on a number of things that I as a statistician would want to know about the background, study design, etc. So I don’t think the right thing to do for us as statisticians is to make general “authoritative statements”, but neither do I think we have nothing to say on these issues.

Also, personally I think that people tend to jump to too strong conclusions far too quickly. Much of what we do is “exploratory” in one sense or another and the result is not a stab at a “best explanation”, but rather something that gives us somewhat better ideas and knowledge to be used in the next step.

At the end of the day it would be sad if researchers “would not bother” with Stats if they knew how slow the process is and how far from well justified strong statements, “best explanations” etc. we still are. Because they have data to analyse in any case and I doubt that they could make better sense of them without the statisticians. The statistician may often be the one who could stop a researcher from jumping to a too strong conclusion too quickly (much work criticised by Meehl did not involve statisticians although it involved statistics, I’d believe), but you may be right that this often doesn’t make the statistician very popular.

]]>Thanks for the link! Amazing lectures.

]]>Upon reading:

People went from experimenting on static cling, to the iPhone in two centuries and change.

About the same length of time it took to go from Laplace to the current slop in statistics.

I laughed so hard that I began to worry if I would ever breath normally again. Funny, but a little sad—like many of the sharper observations on this blog.

Bob

]]>Thank you. Makes sense.

]]>“Frequentists get all of their methods from intuitive ad-hoc inventions. Such methods suffer from any limitations their intuitions have. In practice while their intuition may be superb, it still butts up against human limits and every one of their methods has severe failings because if it.”

Reminds me of this quote:

“…the anxious precision of modern mathematics is necessary for accuracy. In the second place it is necessary for research. It makes for clearness of thought, and thence for boldness of thought and for fertility in trying new combinations of ideas. When the initial statements are vague and slipshod, at every subsequent stage of thought common sense has to step in to limit applications and to explain meanings. Now in creative thought common sense is a bad master. Its sole criterion for judgment is that the new ideas shall look like the old ones. In other words it can only act by suppressing originality.”

An Introduction to Mathematics. Cambridge: Cambridge University Press, 1911. http://www.gutenberg.org/ebooks/41568

]]>sorry for the delay in responding, Rahul, but health problems, as well as time zone problems, keep me from being very dependable.

The journal has page charges for publication. That is not so unusual now, but in psychology paying to have a paper published has always seemed suspect. The pressure on the journal editors was to fill up issues, and so it published some fairly marginal stuff. I suspect the rejection rate was quite low. Papers published in the journal (and I had at least one) were sent out for peer review, but that review was usually pretty cursory and easy for authors to deal with. On the other hand, some reviewers were careful (as I tried to be), and some authors took the recommendations pretty seriously.

So, yes, papers were peer reviewed, but review was not always rigorous.

Lee

Christian,

Sorry for any possible confusion. I am not the same person posting under the name “Anonymous” in this thread. I agree with Andrew when he wrote: “I don’t like Bayesian versions of null hypothesis significance testing either.”

My posts here are focusing on what Meehl refers to as the “corroboration problem”. To use another Meehl term, imagine “Omniscient Jones” told us the exact effect size. Then what? If statisticians really have nothing to say on this, they really need to make it clear. It is not that anyone claimed to solve that problem, but I suspect many researchers would not bother with stats if they truly understood this to be the case. They have much bigger problems to deal with first.

]]>Anon: Your thoughts are appreciated. In many situations you can neither have 1) nor 2); “very expensive, possibly impossible”, as you say. If we can have it, we should, I agree. If we can’t have it, we need to be modest. Frequentists and Bayesians alike. But people want to make strong claims to get their grants, media coverage, industry support, so modesty is not fashionable. That’s not a frequentist vs. Bayes problem.

(In my original post in this thread I emphasized that only a quite small part of Meehls “obfuscating factors” has to do with the frequentist vs. Bayes issue. I don’t think we do the paper justice if we focus all too much on this here.)

]]>The Gelman and Shalizi one that I thinks spells this out most clearly and fully is:

http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf

]]>“If we have a significant (in this sense) clustering, of course we may be interested in explaining this more precisely”

This is the what non-statisticians want an algorithm to achieve, in the absence of this they create myths about what the statistical methods they are taught can do for them. The latter is observational fact. I would guess they do this because they cannot understand why they were taught a method, and why everyone uses a method, that cannot provide this information.

Statisticians really need to make a clear, authoritative statement (maybe the ASA) on what they can agree to define as valid forms of NHST. This should be done both in prose and using more rigorous notations (eg. math, logical). It should be in the first and last sentences that NHST cannot explain the reason why the null hypothesis is “false”. They should also enumerate precisely what it is that the valid forms of NHST can achieve and compare its drawbacks and merits to other approaches. Under what conditions is NHST thought to be the optimal method by it’s statistically-trained proponents?

I propose that to explain an effect (here, the clustering) requires:

1) An a priori prediction deduced from some theory consistent with new data.

2) Ruling out any plausible alternative explanations that people come up with.

Vague a priori predictions (“no effect”, “positive/negative effect”, “some clusters”) are not invalid. However, in practice then step 2 will be very expensive, possibly impossible, to achieve with any amount of rigor. It will require many controls and assumptions regarding construct validity to be checked. For that reason Meehl (1990) concluded we should strongly prefer, even require, precise predictions:

“In the strong use of a significance test, the more precise the experiment, the more dangerous for the theory. Whereas the social scientist’s use… where H0 is that “These things are not related,” I call the weak use. Here, getting a significant result depends solely on the statistical power function, because the null hypothesis is always literally false.”

The difficulty in deducing any precise prediction is the main problem with using what Hull (1935) called “isolated and vagrant hypotheses”, contrasted with “experiments which are directed by systematic and integrated theory…[which] in addition to yielding facts of intrinsic importance, has the great virtue of indicating the truth or falsity of the theoretical system from which the phenomena were originally deduced”.

If statisticians have an alternative method of determining the best explanation for an observed “effect” (be it clustering, difference between means, or something else) then what is it? I have not been able to find it. If they do not have this, a statement to that effect should be included in the clear, authoritative statement mentioned above.

Sorry for the long post, but I don’t see how it can be made any more concise.

Meehl, P (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It”. Psychological Inquiry 1 (2): 108–141

http://www.tc.umn.edu/~pemeehl/147AppraisingAmending.pdf

Hull, C. L. Nov 1935. “The conflicting psychologies of learning—a way out”. Psychological Review, Vol 42(6), 491-516. http://psychclassics.yorku.ca/Hull/Conflict/

]]>I think I got what I needed from the various exchanges. I think the underlying difficulty with communication on this matter is that you don’t really believe in NHST so Pr (data|H0) is not the fundamental concept that most statisticians think it is. A particular example: H0 is the mean of a normal (mu, 1) so Pr (H0|data) is meaningful, but there is an assumption also of basing the estimation on the assumption of the normal which Pr (H0|data) doesn’t really test since H0 consists of not only the estimate of mu but also of the assumption of normality, and so Pr[H0|data] is misleading at best (and potentially completely wrong) as a test of the model. Is this your thinking in a nutshell, or is it more complicated than that?

]]>“If that were the gaol . . .” All too true!

]]>I actually agree with you that frequentism is unfalsifiable in a certain sense. According to my interpretation, frequentism is *not* a description of reality which in this sense could be true or false, but rather a way of thinking about certain phenomena; as are the different flavours of Bayesian statistics.

You can say that people did too many stupid things with frequentism, so you don’t like it, which is fair enough. Nowhere will you find me talking about Bayesian statistics in the same way you talk about frequentism and the frequentists; I think that it’s legitimate and often fine not to do things in a frequentist but in a Bayesian manner.

But while you’re talking, people do stupid things with Bayes, too, as you know and have already conceded (and Bayes isn’t exactly only around for a year or so). So I just don’t think that this works as an argument against frequentism and in favour of Bayes.

]]>Anon: No disagreement. My paper says explicitly that in this case non-rejection is more informative, because it means we found a model that can explain the data without clustering, and therefore we know that the data can’t be evidence in favour of a clustering. I believe that this is useful, and in this case the test delivers exactly what we should be interested in.

If we have a significant (in this sense) clustering, of course we may be interested in explaining this more precisely, and you’re right that the test alone doesn’t give an explanation. But nowhere I said that the test is the only thing we should do and will tell you everything.

You may think that a proper full Bayesian analysis is the only thing that should be done and can address all possible questions. But this is really only the case if you’re able to start, before seeing the data, with a prior over absolutely everything that is conceivable, and I have yet to see a single Bayesian analysis that manages to do this. (OK, perhaps if you model a single toss of a coin.)

]]>Andrew: OK, I kind of suspected that this was the meaning you gave the term “NHST” when you gave the first response to me, and in this sense I agree. But talking about “null hypothesis significant testing” in this way is misleading because people may think that you mean all kinds of significance tests of hypotheses, including the useful ones.

]]>Andrew,

The goal of statistical analysis isn’t to get to the truth. That’s impossible for us mortals in general. If that were the gaol though then perhaps posteriors for models don’t “make sense” and aren’t “useful”. Frequentists seem to be guided by that kind of mindset, and they have no use for posteriors.

The real goal is subtly weaker, but actually achievable by us humans. The actual goal of as statistical analysis is to get as close to the truth *as the evidence allows*. In that case, posteriors for models makes perfect sense and are useful.

If I=”a meteor wiped out the dinosaurs”, then in practice we deal with things like P(data |I,K) where “K” resents lot of other hypothesis or information. So if you use Bayes theorem you’re really getting a P(I | data, K). Usually, we don’t explicitly write the “K” but it’s always there.

So saying P(I |data, K) has a dependence on K isn’t any kind of limitation to computing or using P(I |data, K). Since such K’s are always present in truth, you’re basically deny that Bayes theorem ever “makes sense”.

If K is true, then you’ve learned something about meteors. If K is questionable, then you need to evaluate it. In order to evaluate a claim it’s “useful” to know what that claim implies. Expressions like P(I|data, K) tell you what K implies.

So it’s both “makes sense” and “useful” to compute P(I|data, K). What doesn’t make sense and what isn’t useful is to pretend that you can get at the truth of I, without dealing with K.

Your point (3) is another whole ball of wax. If you’re using probabilities to model uncertainty rather than frequencies, that will problem will NEVER occur. If it’s “arbitrary” whether N(0,sig=10) or N(0,sig=100) are used, then that can only happened if you have no evidence for or against values like 50. It’s up to you to use distributions which faithfully reflect the uncertainty implied by the evidence you have. If you don’t, that’s your limitation, not Bayes’s.

]]>Andrew,

Here is the big picture of what’s going on. Frequentists get all of their methods from intuitive ad-hoc inventions. Such methods suffer from any limitations their intuitions have. In practice while their intuition may be superb, it still butts up against human limits and every one of their methods has severe failings because if it.

Bayesians have an incomparable advantage. They’re basing everything off the sum and product rule. While these often agree with our naive intuitions, they often improve on anything we can see naively. So Bayesians can use this to both improve our intuitions and to get things right when intuition fails.

Consequently when faced with some intuitive idea in statistics which seems to have a grain of truth, Bayesians should work to fit it into the Bayesian framework broadly defined. Jaynes did this constantly and from my limited reading of Rubin, he did it at least sometimes. Maybe he did it all the time. By not doing that with the your model checking stuff it has at the following consequences:

(1) You’re teaching a new generation of Bayesians to engage in the same intuitive ad-hocaries which hobbled classical statistics.

(2) You’re not teaching a new generation the real power of the sum and product rules, which is easy for new students to miss because they have to be the most innocuous looking equations ever.

(3) There are special cases where the Bayesian version is significantly better than your intuitive model checking. If you claim that’s not true, then you’re basically saying the sum and product rules of probabilities are sometimes false. Good luck with that.

(4) You’re giving ammunition to charlatans like Mayo, who don’t known 1/1,000,000 the math needed to check any of these technical facts, but seize on your words to proclaim “even most Bayesians reject Bayes these days”.

(5) You’re perpetuating the Statistician’s fallacy which, for some reason I don’t understand, statisticians commit at a rate a million times greater than everyone else. Namely the belief that “If I don’t see how to do it, it must be impossible”. Just because you don’t see how something fits in a Bayesian framework, doesn’t give you the right to claim it doesn’t fit.

]]>Jeff:

I don’t completely agree with the “selling shoes” comment. I think it’s more than that. As I noted in the above post, lots of statisticians who are highly technically competent (including me, for many years!) were generally aware of problems with selection bias but did not realize how central it is to much of statistics-as-it-is-practiced. Being aware of such problems did not suddenly put me out of a job; rather, it allowed me to do my job more effectively!

I agree that there is some number of practical researchers who can’t do much more than turn the crank, and for them there is a positive value in methods such as null hypothesis significance testing that allow them to turn raw data into published papers, to perform “uncertainty laundering,” as I put it in one recent paper. But the question I wanted to raise in the above post was not what *their* problem was; rather, I was asking what was *my* problem, and the problem with the statistics profession, that we did not realize the scale of this issue, that we naively thought that problems with hyp tests could be solved using conf intervals, etc.

Anon:

I appreciate you and others pushing me to explain this more carefully but I’d appreciate a bit of clarity on your part. I can’t be sure but I think you’re being sarcastic when you say “Good thing there isn’t one,” and you’re referring to Bayes’ theorem.

My problem with applying Bayes’ theorem in this way is the usual GIGO: in the sorts of examples I’ve worked on, the marginal posterior probabilities of the data under different models are not so meaningful because these probabilities depend crucially on aspects of the model that are set arbitrarily.

Finally, you can call my attitude “silly” all you want, but it’s my experience. I have not found this sort of thing helpful in my applied work. I fully accept that others have found these methods helpful, indeed I said as much in my 1995 paper with Rubin. Lots of methods depend on assumptions that don’t make complete sense, but can still be useful. Maybe not useful to me, but useful to other practitioners who have mastered the approach and can understand the numbers that come out.

]]>Anon:

The trouble is that a event such as “a meteor wiped out the dinosaurs” does not imply a single model for data. And you can’t compute a meaningful posterior probability for such an event without a full probability model.

Think of it this way.

Suppose you have a general scientific theories T1 and T2 (for example, T1 is the theory that ovulation is related to voting, and T2 is the theory that there is no relation), and corresponding statistical models M1 and M2 (that is, probability models with unknown parameters, priors on the parameters, probability distributions for observed data, measurement errors, sampling, the whole deal).

Here are my problems with using Bayesian inference to get the posterior probabilities of T1 and T2:

1. In the sort of applications I’ve worked on, it just doesn’t make sense to talk about the probability that T1 or T2 is true (in the example I’ve just given, everything is related to everything; there is certainly *some* connection between ovulation and voting).

2. The connection between T and M is typically speculative and weak. That’s a key problem with null hypothesis significance testing (Bayesian or otherwise), that the rejection of M2 is taken as evidence in favor of T1. But this is highly dependent on how M1 and M2 are formulated (for example, assumptions about measurement errors).

3. In the Bayesian setting in particular, the posterior probabilities of T1 and T2 are typically highly dependent on aspects of the prior distribution that are set arbitrarily, for example if you change a weak prior on some parameter from N(0,10^2) to N(0,100^2), you change the likelihood ratio by roughly a factor of 10. I’ve written about this in various places, notably my 1995 article with Rubin and chapter 7 of BDA3.

]]>+1

]]>You see some fundamental distinction between events and models that I don’t see. If I’m a biologist and I building a statistical model using the hypothesis “a meteor wiped out the dinosaurs” is that an event or model?

]]>If there were a theorem relating P(Model | actual_data) and P(actual_data |Model) that would make comments like:

“I have not found this sort of thing helpful…it does not make sense”

seem pretty silly. Good thing there isn’t one.

]]>Anon:

You can feel free to compute posterior probabilities of models. I have not found this sort of thing helpful, for reasons I’ve discussed in many papers, including my 1995 paper with Rubin, my 2012 paper with Shalizi, chapter 7 of BDA3 (this was material that was in chapter 6 of the earlier editions), etc.

There are special cases of well-defined problems where Pr(model) makes sense to me; we give an example in chapter 1 of BDA. But in most of the cases I’ve seen this idea applied, it does not make sense.

]]>