I’m not sure how the New York Times defines a blog versus an article, so perhaps this post should be called “Bayes in the blogs.” Whatever. A recent NY Times article/blog post discusses a classic Bayes’ Theorem application — probability that the patient has cancer, given a “positive” mammogram — and purports to give a solution that is easy for students to understand because it doesn’t require Bayes’ Theorem, which is of course complicated and confusing. You can see my comment (#17) here.

Of course it is Bayes' theorem, and mathematically equivalent.

But I have been teaching the concepts using Gigerenzer's method for a number of years, well over 10, and I find that the approach is much easier for people to understand when they are naive about probability.

For example, using natural frequencies on a population of 1000 patients, I can do the breast cancer example (and have done it many times) to someone who has just told me how he hated her statistics course 30 years ago, or to someone who has never taken statistic, entirely with words, no pencil and paper, and they will understand it. Furthermore, in my classes the students will be able to apply the idea to solve other problems. Only after they get some practice in this do I then use the idea to introduce the "formal" form of Bayes' theorem.

I would not try to solve this problem with no pencil and paper and a naive person listening using Bayes' theorem in its usual form. It would be like talking to a tree.

My hope while reading the blog – before I read your post here – was that he meant people do this process rather than take out their cheat sheet on Bayes. In other words, I think he misspoke or glided past Bayes because people do that.

You might want to check out comment 28 and the link. The labels are hideously confusing but the graph is a nice idea.

Basically what he's saying is that people *psychologically* have an easier time understanding Bayes Theorem, if you multiply the whole thing by a power of ten such that no intermediate results is smaller than 1.

It's a mathematically trivial result that A * 10000 / 10000 = A but it's not psychologically trivial.

I wish he'd said it like that, but other than that I think Strogatz is doing a favor to bring this kind of stuff up in the context he's writing in.

Basically having real numbers/real world examples is easier to understand than formulas alone.

Sometimes you want a different multiplier than a power of 10. For example, Monty Hall (if you use that example early on, as I do).

Bill: As we have discussed before, I think you have it the least wrong

"Only after they get some practice in this" move on to equivalent formulas

I even do Simpson's paradox this way now

4/5 > 7/10

3/10 > 1/5

(3 + 4)/(10 + 5) ?> (7 + 1)/(10 + 5)

In Bayes the joint data table with round enough numbers is exact and a much less error prone calculation device

100,000 table

99200 92256 6944

800 80 720

720/(720 + 6944) = .0939

getting an approximate joint data table from simulation provides a bridge to MCMC but that much harder to grasp

K?

K?

Maybe you misunderstood me. I use the Gigerenzer approach long enough for the students to get the idea and use it on simple problems.

Then I unpack it and introduce other tools, which we use for the rest of the semester. (Natural frequencies are inadequate for a number of the problems I pose over the semester.)

It is a freshman/sophomore honors course that I have taught at Texas and Vermont eight times. The students in the course have all sorts of majors, pre-med, journalism, math, engineering, even one dance major. Thus, I have to stop short of calculus. I gave a paper on this at the recent Jim Berger Festschrift. See

http://bergerconference2010.utsa.edu/Speakers.htm

Or maybe I misunderstood your point.

Bill – I'll have a closer look at your slides but I think we agree where to start

Natural frequency or table presentation of a diagnositic problem to emphazise the joint model, marginalizing and conditioning.

Once thats _driven home_ move on to more challenging things

I like to make a distinction about the (always somewhat wrong) model being the table versus the marginalizing and conditioning being the (always correct) calculations from the model/table to get more relevant before and after test results model for _your probability_ of disease

Not sure if the simple table makes the model versus the calculations easiest to grasp

When I move on its to get credible intervals ideally after seeing plots of logprior + loglikleihood for a relevant parameter of interest

And I avoid MCMC or any other math – as long as possible – ideally until after they have grasped what Bayes is _really about_

K?

What I don't get is why this is called a "Gigerenzer approach". It's strange for Stogatz to credit Gigerenzer in that way.

What is also surprising is how few textbooks cover this type of example, P(A/B) and P(B/A) in a screening test of any kind. While preparing a recent talk, I searched many textbooks and ended up finding it in Gigerenzer's book too. I couldn't believe how many do not have it.

I have other thoughts on what's missing from Stogatz's column. Will post when I get to write them down.

K?

I think we do agree. I misread your first comment.

The course I talked about at the Berger Festschrift never gets to MCMC. I use "spreadsheet" calclulations on the whiteboard with perhaps 10 states of nature to simulate the continuous case (students whip out their calculators and compute the likelihoods, etc). I point out that a computer spreadsheet can deal with perhaps 100 states of nature and get better accuracy. I do point out for students who have had calculus (not all have) the connection between the spreadsheet calculation and the Riemann integral. For those students, it is an "aha moment." But such spreadsheets are the most anyone does with computers.

This does limit what we can do, but it is surprising how far one can go. And by keeping it simple I can work harder on "understanding the principles."

The course is half decision theory and sticks strictly to finite state spaces. The whole idea is to get people thinking in a Bayesian way, because (as noted in the NYT blog) probability language is ubiquitous in the world out there, and most people don't know how to deal with it or understand it. And I believe that ultimately the Bayesian approach is more intuitive and more easily understood than the standard approach.

BTW, the "Happy Course" that Xiao-Li Meng got going at Harvard has a similar target audience and similar goals to my course. They do have a somewhat different selection of topics, but again the goals are understanding principles rather than developing sophisticated techniques, and useful for a general audience. Xiao-Li attended my talk and we had a very nice discussion afterwards. In fact, he followed me on a very different subject, but took a few minutes out to comment on my talk, which was very nice of him.

Here's a URL for a short description of Xiao-Li's course. There's a nicer pdf version but I don't see it offhand:

http://www.stat.harvard.edu/Academics/invitation_…

I learned this routine from my primary school: you have to translate every sentence into a symbolic formula and finally write down the quantity you are supposed to get in a symbolic way.

All the remaining work is to recall the relevant/appropriate formula from the textbook and use it. After that, the entire process is purely mechanical: make sure you replace each symbol with the right number.

the point is, we do not have to re-invent the bayes theorem every time we have to use it.

Wei,

What you describe is exactly the opposite of what I'm trying to do. Physicists call this way of doing things "plug and chug," and it is a very ineffective way to teach physics, and I would wager, probabilty and statistics.

Eric Mazur, a physicist at Harvard, thought he was a pretty good teacher. He could give a problem where there was a circuit diagram, resistors, batteries, etc., and ask the students to figure out the voltages and currents in each leg of the circuit. The students were very good at that because they'd learned Ohm's law etc. and could "plug and chug." But then he decided to give them a different kind of test. So he designed a number of simple circuits with batteries, resistors (of unknown resistance), switches, and light bulbs (of unknown resistance) and asked different questions, like "what will happen when I close this switch? Will the light bulb brighten, dim, go off, or stay the same?

He was shocked to find that the students were completely unable to solve these problems.

He figured out that the methods that these very smart Harvard students were being taught did not give them the intuitive "feel" for the underlying physics that good physicists really need.

So he developed an entirely different method of teaching, Peer Instruction, that is quite effective. This method has been applied in many technical areas with good success. I have used a version of it myself. See:

http://www.physics.umd.edu/perg/role/PIProbs/

Bill:

I'm a big fan of Mazur's ideas (as I believe I've mentioned on occasion on this blog). But I do think it's a good idea, when discussing these methods, to emphasize that ultimately what we're trying to teach students is not specific skills or even general understanding but rather to give them the ability to be able to find the resources need to solve problems on their own in the future. That is, we want students to have enough skills to recognize and navigate basic problems and enough understanding to know where to go when they're stuck.

Not knowing what others need to grasp to best ensure that they likely can solve problems on their own in the future – is perhaps the real challenge.

Mostly the guide we have for this is our past selves and as nicely pointed out by Ken Iverson the APL developer, thats not often a good guide for what others need to grasp.

So, for instance my two-stage sampling (aka nearest neighbors) recasting of Bayes theorem would have been helpful to a past me – http://www.stat.columbia.edu/~cook/movabletype/ar… – but not everyone

It seemed to catch a Bill by surprise and examples of continuous parameter inference are quite doable as a bridge to MCMC

K?