Haynes Goddard writes:

I thought to do some reading in psychology on why Bayesian probability seems so counterintuitive, and making it difficult for many to learn and apply. Indeed, that is the finding of considerable research in psychology. It turns out that it is counterintuitive because of the way it is presented, following no doubt the way the textbooks are written. The theorem is usually expressed first with probabilities instead of frequencies, or “natural numbers” – counts in the binomial case.

The literature is considerable, starting at least with a seminal piece by David Eddy (1982). “Probabilistic reasoning in clinical medicine: problems and opportunities,” in Judgment under Uncertainty: Heuristics and Biases, eds D. Kahneman, P. Slovic and A. Tversky. Also much cited are Gigerenzer and Hoffrage (1995) “How to improve Bayesian reasoning without instruction: frequency formats” Psychol. Rev, and also Cosmides and Tooby, “Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty”, Cognition, 1996.

This literature has amply demonstrated that people actually can readily and accurately reason in Bayesian terms if the data are presented in frequency form, but have difficulty if the data are given as percentages or probabilities. Cosmides and Tooby argue that this is so for evolutionary reasons, and their argument seems compelling.

So taking a look at my several texts (not a random sample of course), including Andrew’s well written text, I wanted to know how many authors introduce the widely used Bayesian example of determining the posterior probability of breast cancer after a positive mammography in numerical frequency terms or counts first, then shifting to probabilities. None do, although some do provide an example in frequency terms later.

Assuming that my little convenience sample is somewhat representative, it raises the question of why are not the recommendations of the psychologists adopted.

This is a missed opportunity, as the psychological findings indicate that the frequency approach makes Bayesian logic instantly clear, making it easier to comprehend the theorem in probability terms.

Since those little medical inference problems are very compelling, it would make the lives of a lot of students a lot easier and increase acceptance of the approach. One can only imagine how much sooner the sometimes acrimonious debates between frequentists and Bayesians would have diminished if not ended. So there is a clear lesson here for instructors and textbook writers.

Here is an uncommonly clear presentation of the breast cancer example: http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/. And there are numerous comments from beginning statistics students noting this clarity.

My response:

I agree, and in a recent introductory course I prepared, I did what you recommend and started right away with frequencies, Gigerenzer-style.

Why has it taken us so long to do this? I dunno, force of habit, I guess? I am actually pretty proud of chapter 1 of BDA (especially in the 3rd edition with its new spell-checking example, but even all the way back to the 1st edition in 1995) in that we treat probability as a quantity that can be measured empirically, and we avoid what I see as the flaw of seeking a single foundational justification for probability. Probability is a mathematical model with many different applications, including frequencies, prediction, betting, etc. There’s no reason to think of any one of these applications as uniquely fundamental.

But, yeah, I agree it would be better to start with the frequency calculations: instead of “1% probability,” talk about 10 cases out of 1000, etc.

**P.S.** It’s funny that Goddard cited a paper by Cosmides and Tooby, as they’re coauthors on that notorious fat-arms-and-political-attitudes paper, a recent gem in the garden-of-forking-paths, power=.06 genre. Nobody’s perfect, I guess. In particular, it’s certainly possible for people to do good research on the teaching and understanding of statistics, even while being confused about some key statistical principles themselves. And even the legendary Kahneman has been known, on occasion, to overstate the strength of statistical evidence.

“Probability is a mathematical model with many different applications, including frequencies, prediction, betting, etc. There’s no reason to think of any one of these applications as uniquely fundamental.”

If one of the applications was general enough to include many or all of the others as special cases, that would be a reason to think that it’s (more) fundamental, no?

Corey:

Any of these models is general enough to include all the others: you can start with relative frequencies and derive prediction, betting, etc.; you can start with prediction and from that get relative frequencies, betting, etc.; you can start with betting; etc.

But the different models are more or less applicable depending on context. Relative frequency doesn’t make so much sense when absolute frequencies are low. Prediction doesn’t make much sense if you’re not actually making predictions. Betting doesn’t make sense if you consider selection of what bets you’re offered. But that’s ok. Probability is a mathematical model. Mathematical models aren’t perfect.

In all likelihood (ha ha) you already know where I’m going with this, but I still feel compelled to air my disagreement, for the sake of the readership if no one else:

Probability as a model of observable relative frequencies can’t get you to probability as a calculus of plausible inference. Probability as a calculus of plausible inference gives you relative frequencies via de Finetti’s exchangeability theorem.

I actually find motivating applications of probability via ideas like exchangeability and symmetry a nicer starting point than ‘plausible reasoning’ (too much baggage) that sticks closer to the ‘real’ application-agnostic foundation – the axiomatic definitions defining the mathematical model.

Of course, it’s also nice that the (assumed) requirements of many applications lead to the same mathematical model. But I like the idea of moving back and forward between the physical or application concept and the corresponding mathematical model, while keeping these separate.

“Probability as a model of observable relative frequencies can’t get you to probability as a calculus of plausible inference.”

I think of probability-as-calculus-of-plausible-inference as derived from an analogy with finite populations (or, say, finite areas, as on a dart board), a la Bernoulli: “probability is degree of certainty and differs from absolute certainty as the part differs from the whole”.

The derivation I have in mind is a theorem (Cox’s theorem), not an analogy.

Cox’s theorem shows that a bunch of axioms are sufficient to establish probability theory. But all this says to me is that these axioms get you to the same theory that the finite mass/finite population analogy does. They support probability’s “specialness” but not its uniqueness or normativity, unless you think Cox’s axioms are immutable.

But weakening the axioms or adding others can lead to different, not-so-crazy calculi. See e.g. Norton, J.D. (2007). Probability Disassembled. Br J Philos Sci 58, 141–171. (Available here.) The paper suggests an analogy with Euclidean geometry:

Thanks for the link to the Norton paper. My thinking on the grounds for Cox’s axioms can be found here.

Andrew you’re right. There’s no reason at all to think there’s a single foundation for estimating frequencies under uncertainty, estimating non-freq physical parameters under uncertainty, betting under uncertainty, predicting under uncertainty.

All those people who see probabilities as a calculus of plausible inference in the face of uncertainty just aren’t smart enough to see how inherently flawed this is.

Anon:

This sarcasm is unhelpful. I refer you to chapter 1 of BDA where we have several different examples of empirically determined probabilities.

Life is full of little ironies: it occurs to me that I’m arguing to the author of possibly the most well-written and popular advanced text on Bayesian statistics that the Bayesian approach is

moreawesome than he thinks it is.I think it was always one of Andrew’s selling points that he has a – let’s say – pluralistic Bayesian approach. A careful and benevolent criticism of the traditional reasoning which allows applied researchers and statisticians, who’ve been taught a frequentist approach all the time, to slowly get accustomed to the heretic ideas of Bayesian statistics is far more useful than an aggressive and sometimes even proselytizing tone some missionaries of Bayesianism embrace. ;-)

Proselytizing Bayesians seem to very often come from the Church of Jaynes.

That’s just a question of timing. Before Jaynes came the subjective school of Savage and de Finetti, with followers (e.g., Lindley, Kadane) who are the equal of the most strident Jaynesian.

True.

Some men induce a cult following: e.g. Jaynes, Judea Pearl, I J Good, C S Pierce.

“There’s only one” (K O’Rourke) ;-)

Daniel:

Amy argues Peirce (note correct spelling) is pluralistic not me!

http://muse.jhu.edu/journals/csp/summary/v045/45.3.mclaughlin.html

I would agree, but given that encourages inducement of a cult – I will decline!

Hey, in the long run we are all run – in the short run we can pretend not to know that.

In the long run we are all wrong – shesh.

Jaynes did for Bayesians what Ayn Rand did for libertarians.

@rahul you make so many assertions, yet provide so little information. i’ve never heard you say exactly what it is about Jaynes’s work you disagree with?

That’s all well and good, but it doesn’t really touch on the question at issue (the one at issue for

me, anyway), which is a mathematical one. The Bayesian and frequentist perspectives oe probability are often portrayed as antithetical — typically: Bayes = belief, frequentist = empirical relative frequencies — so it’s important to be clear that the actual relationship is that Bayesian probability calculations are asupersetof frequentist probability calculations, and exchangeability is the bridge. (This is a bare mathematical fact that leaves open the question of which calculations are the correct ones for learning from data, andthatis the question on which the two schools are truly at odds.)+1

And as above that’s why I find starting from probability calculations and using exchangeability as a bridge to bayes far more convincing than most typical presentations. And just a nice way of presenting probability applications fullstop.

I’m curious [without wanting to spark another tiresome debate – external links rather than a long comment thread are fine ;-)] – are there any frequentist criticisms that directly address this approach? I have seen frequentists invoke exchangeability and then drop the priors…

To answer my own question – see Gillies, D. (2000) Philosophical Theories of Probability, Routledge, for a start.

> frequentists invoke exchangeability and then drop the priors…

See comment by David Cox on the Lindley and Smith paper – Bayes estimates for the linear model. 1972.

The argument seeming to be “common physical random mechanism” is what _needs_ to be modeled and that can be done without a prior (albeit accepting poor repeated sampling properties when data is sparse until better higher order asymptotics are somehow developed.) So drop the prior.

There does seem to be some taboo to using any random model for anything other than physical random mechanisms for any purpose (even non-literally) …

One of Andrew and my colleagues suggested this is due to over deference to RA Fisher’s dismissal of any one method of induction. That is hard to disagree with – induction cannot be deductive and Lindley’s recollection of the sacrosanct LP providing such a axiom based (deductive) inference probably painted Bayesianism as largely being induction made deduction.

There is also Fisher’s dismissal of work by Cochran (1937) and Yates (1938) on the analysis of repeated experiments using the Normal-Normal random effects model and variations on it.

Thanks!

re ‘taboo to using any random model for anything other than physical random mechanisms’

– not even if interpreted as an ensemble of deterministic models?

> – not even if interpreted as an ensemble of deterministic models?

In the 8 schools example, the effects from schools are a physical ensemble so those can be considered as random but not the hyper-parameter!

see – http://statmodeling.stat.columbia.edu/2014/01/21/everything-need-know-bayesian-statistics-learned-eight-schools/#comment-153456

Hmm I personally didn’t mean that it had to be a physical ensemble, but I suppose you are pointing out that there are those who do insist? I guess it matters most if you want to do things like intepret averages over the ensemble physically e.g. thermodynamics, rather than as a summary of…an ensemble of model runs…

I’ll have to reread inhowfar it is a (strict) superset exactly but I’m not even sure, if anyone actually denied it. The question here wasn’t a mathematical one, was it? That Bayesian probability is a superset doesn’t make it better or a more fundamental

*application*. A more restrictive approach can actually be “more correct” as it may be closer aligned to what you want.

It’s not really a question of someone coming out with an explicit denial; it’s more of a default assumption people make on account of the actual dispute between the two schools of thought.

Often you can get clarity on a philosophical question by cashing it out as a mathematical one. We can ask: since Bayes is often presented as an account of subjective or personal probability, what does it have to say about the kinds of scenarios in which frequentist techniques have had the most success, i.e., repeatable trials that are notionally random? The Bayesian perspective on such scenarios is that nothing distinguishes any one “random” trial from any other, so trials are exchangeable. If they are infinitely exchangeable (i.e., the number of exchangeable trials can grow without bound) it follows as a mathematical theorem that a Bayesian agent expects stable relative frequencies, and that the expected outcome relative frequencies are equal to the predictive probabilities of outcomes on

the nexttrial. If it wasn’t already obvious that relative frequencies satisfy the usual probability axioms, we’d be forced to recognize the fact at that point anyway.I’m instinctively skeptical of the claim that people can reason significantly better in Bayesian terms when presented data one way than another.

How robust is the underlying empirical work, anyone know? After all the iffy Psych studies Andrew covers on the blog on a daily basis, I’m very leery of results that emanate from Psych Depts.

http://bit.ly/1CQqCDC

lmgtfy only works as sarcasm if the search phrase is simple and obvious.

I have working on using Galton’s two stage quincunx to get a frequency like view of statistics with continuous outcomes and multiple samples.

There is animation and R code to generate better animations here https://phaneron0.wordpress.com/2012/11/23/two-stage-quincunx-2/

What I have learned from some webinars and first hand tutoring is that there appears to be some conceptual challenges. These seem to be the need to grasp abstract modeling required to represent empirical reality. For instance, in the animations there is a machine that represents how nature generated the observation and then a second machine to represent how an analyst would represent that and then work with their representation to say get an interval for an unknown. Other statisticians seem to get the need for two machines right away but others seem not to.

Given what I am doing is just inefficient simulation from the posterior, most problems could be done this way until they get too complex (i.e. for hierarchical models just a subset of 5 groups rather than all 30).

Might there be a subconscious aversion to frequencies amongst Bayesians as it blurs the distinction with frequentist methods?

I have taught a freshman/sophomore honors college seminar at the University of Texas and at the University of Vermont. I taught it first at UT in 2000, teaching it five times there and after moving to Vermont, four times at the University of Vermont. Another professor here has taught basically the same course now that I’ve retired, and Jim Berger taught a version at Duke at least once.

Since the students were, except for membership in the honors college, from the general population, there could not be an assumption about mathematical ability beyond what being in the honors college implied, nor an assumption about major. In fact, most of the students were not majoring in STEM subjects. I had a significant number of pre-med and pre-law students, even a dance major once. Of course, there were some math/physics types as well (and three of them out of about 150 students that took it during those 9 offerings actually became professional statisticians, though that wasn’t the goal of the course).

The seminar used only finite state spaces (so no calculus) and was organized primarily as a course in Bayesian decision theory. I taught the probability using Gigerenzer’s ideas and his book, “Calculated Risks,” which I highly recommend for a context like this. I also usually used Hammond, Keeney and Raiffa’s “Smart Choices” and sometimes recommended other books.

I really think that Gigerenzer’s ideas about natural frequencies as an approach to teaching probability ideas work well and I highly recommend them. Incidentally, I have often run across folks who, when they learn that I’m a statistician, tell me how much they hated the course. I then tell them that that’s not the kind of statistics I teach, and give them a simple example (usually false positives of mammograms since it’s an easy example and can be done without even writing anything down), to which the response is something like “Wow, that’s so simple. I wish my course had been taught like that!”

The link to the course webpage the last time it was taught is here:

http://bayesrules.net/hcol196.html

There’s a link there to my course blog, which includes shots of the whiteboards taken by my iPhone. (There are other courses there as well…this offering was taught in the Spring of 2011 to make it easy to find the relevant entries).

it’s hard to distinguish “real” difficulty from having to unlearn all those named tests from intro stats.

High time to apply for funding to repeat the study on a cohort from a remote tribe in the Amazon rainforest.

1.Andrew quoted Haynes Goddard:

“This literature has amply demonstrated that people actually can readily and accurately reason in Bayesian terms if the data are presented in frequency form, but have difficulty if the data are given as percentages or probabilities.”

I suspect that part (but not all) of the problem is that many people have problems with percentages and proportions, hence also with probabilities. Starting with frequencies can help them get over this hurdle.

2. Andrew said:

“Probability is a mathematical model with many different applications, including frequencies, prediction, betting, etc. There’s no reason to think of any one of these applications as uniquely fundamental.”

See http://www.ma.utexas.edu/users/mks/statmistakes/probability.html for an example of (one aspect of) how I usually handle the multi-faceted aspect of probability in teaching (undergraduate and graduate).

Somewhat along these lines, the first article in the current The American Statistician by Samsa shows how to break apart the “most published medical research is wrong” argument along frequency lines to make it clearer.

http://www.tandfonline.com/doi/full/10.1080/00031305.2014.951127#abstract

I’m not sure about the ‘grain of salt’ sentence in the abstract, but the tables in the article are clearly laid out.

+1

This NY Times piece by Steven Strogatz deserves a link here. He raises lots of the relevant pedagogical questions, for a general audience (if NY Times readers count as a general audience).

http://opinionator.blogs.nytimes.com/2010/04/25/chances-are/

As a math person the intuitive explanation feels wrong to me because its noisy, it includes info about population size that makes calculations hard, so I think that may be why the hard-to-learn meathod was chosen, because of intuitions by those who are good at and know a field well do not line up with those learning the field.

Learning about probabilities seems easier if you’re mathematically inclined and intend to learn advanced statistics. I doubt a one approach fits all is the way to go.