Bill Harris writes:

I’m not a professional statistician, but I do use statistics in my work, and I’m increasingly attracted to Bayesian approaches.

Several colleagues have asked me to describe the difference between Bayesian analysis and classical statistics. I think I’ve not yet succeeded well, and so I was about to start a blog entry to clear that up. Then I decided to look around.

Your “‘Bayesian inference’ represents statistical estimation as the conditional distribution of parameters and unobserved data, given observed data” from “Objections to Bayesian statistics” is certainly concise, but it may be a bit too concise for managers and analysts who have some understanding of statistics. Your “Why we (usually) don’t have to worry about multiple comparisons” sounds promising, but it’s a tad long to hand to someone with a simple question.

Any ideas? Fodder for a blog posting?

I started down the path of dividing statistical analysis into three parts: setting up the problem, calculations, and communicating the results.

It’s easy to find Web pages about the first, but many dwell on the notion of subjective priors.

The second involves comparing the selection of the proper classical method (Tom Loredo has some articles pointing out those challenges, as I recall) vs. “simply” applying probability theory while often letting a computer grind through the integration. There’s more power, as your “Why we (usually) …” article points out.

The third offers the choice of focus from how one should really interpret confidence intervals and what an hypothesis test is to probabilities of events. That makes sense, but I’m still looking for ways to tighten it all up.

Bill was also pointed to this article by Kevin Murphy, which looks interesting but has almost no resemblance to Bayesian statistics as I know it.

One problem with finding statistical resources on the web, I think, is that a webpage on a technical issue is likely to have been written by a computer scientist. And what computer scientists do with data and models is often much different from what we do.

My current favorite online summary of Bayesian statistics is the article by Spiegelhalter and Rice.

Or, if you want a (somewhat) flashy demo showing what you can get out of Bayes, see these maps of public opinion cross-classified by demographics and state. (Nicer version here.)

As for some points from Bill

“conditional distribution of parameters and unobserved data, given observed data” don’t forget the _AND_ the specification of a joint probability model i.e. prior for parameters and data model/likelihood for the observed data

“: it states the probability of a particular event we care about happening” yes but unless the prior information is especially credible that is just a formal probability of little value or real interest to anyone (I believe Andrew has said something similar before)

When pressed in an interview for an _elevator response_ I once defined classical statistics as trying [not necessarily succeeding but maybe sufficing] to get by without a prior.

Recently, Brad Efron is his OB09 talk suggested that “Very roughly speaking, the difference between direct and indirect statistical evidence marks the boundary between frequentist and Bayesian thinking “ and seemed to suggest that whereas Classical tries to use no indirect evidence at all Bayesian tries to use all the worlds indirect evidence …

(I guess quantified in the prior, so then Bayesian is to use log posterior = log prior + log likelihood and whereas Classical just log-likelihood)

Great to be writing up _elevator talks_ if anything to encourage those who believe they understand the differences to better explicate them for the rest of us!

Keith

E. T. Jaynes book [1] is a classic reference for discussion from first principles (i.e., the Cox Axioms). David MacKay [2] also has some excellent references.

1. http://www-biba.inrialpes.fr/Jaynes/prob.html

2. http://www.inference.phy.cam.ac.uk/mackay/

Recalling the pragmatic explanation of a computer as something that adds, subtracts and multiplies very quickly and accurately

The Bayesian approach is something that often directly generates intervals for unknowns of interest

that have almost as good or even better properties in repeated use than attempts to obtain classical confidence intervals

which can be considerably more difficult and indirect and in some case even currently infeasible

(small print – if the prior is not botched)

Keith

Marcus:

The Jaynes and MacKay books are excellent, but from a statistical perspective, I prefer chapter 1 of Bayesian Data Analysis.

To you, the Cox axioms are first principles; to me, the empirical estimation of probabilities (that is, "frequentist statistics") are the first principles. And, as Keith says, I like Bayesian methods because they do such a good job of estimating empirical probabilities.

I've always regarded the main difference between Bayesian and classical statistics to be the fact that Bayesians treat the state of nature (e.g., the value of a parameter) as a random variable, whereas the classical way of looking at it is that it's a fixed but unknown number, and that putting a probability distribution on it doesn't make sense.

This leads to the distinction between a Bayesian credible interval (which is a distribution on the parameter, viewed as a random variable, the distribution conditioned on the particular data observed), and a classical confidence interval (where the confidence interval itself is regarded as the random variable over the ensemble of possible observed data, but instantiated by the observed data). This in turn leads to the difference between the interpretation of a credible interval and the confidence interval; the latter requires the notion of "coverage" to interpret as a probability, but at the cost of losing the conditioning on the data that were observed in favor of a statement about the ensemble of data, nearly all of which were not observed.

Since my background and training are in the physical sciences, I've noticed that all but the most sophisticated of my colleagues (that is, those that have learned enough statistics to be dangerous :0), think that a confidence interval is a credible interval. Which is natural, if mistaken.

I can't really do much better than the first and second sentences of

BDA:"By Bayesian data analysis, we mean practical methods for making inferences from data using probability models for quantities we observe and for quantities about which we wish to learn."

and

"The essential characteristic of Bayesian methods is their explicit use of proability for quantifying uncertainty in inferences based on statistical data analysis."

To the uninitiated, this just sounds like a description of statistics. To a mathematician or computer scientist, as soon as you lay out measure theory, Bayesian inferences are derivable as theorems using simple calculus.

To understand why Bayesian statistics is different from frequentist approaches, you need to understand the frequentist notion of hypothesis testing, which seems to require even more work than teaching someone Bayesian stats.

Thanks for all the insights. I'm learning from all of them. Yet I wonder: if we told these to the average college-educated non-statistician (e.g., a manager or other professional in business), what would they hear? Would they begin to get an idea of the times they should ask for a classical statistician and the times they should ask for a Bayesian? Okay, perhaps the measure of the first set, given this audience, is rather small. :-) Would they become curious enough to want to learn more?

Not sure this is what you need, but SAS has published a 48 page, resolutely pragmatic "Introduction to Bayesian

Analysis Procedures". It appears geared toward died-in-the wool frequentists, and I'd be curious to know what folks around here think of it.

http://support.sas.com/rnd/app/da/focusbayesian.h…

"Bill was also pointed to this article by Kevin Murphy, which looks interesting but has almost no resemblance to Bayesian statistics as I know it."

That's because the link is about constructing graphical models/Bayesian networks, which use Bayes theorem to update the network based on a stream of data. The mathematical underpinnings of Bayesian statistics and Bayesian networks have some overlap (presumably via Bayes) but the day to day language/techniques are from different worlds (statistics vs machine learning).

I have the same opinion as Bill Jefferys. I also think, that the main difference between Bayesian and classical statistics to be the fact that Bayesians treat the state of nature.

Here's an experiment that I use in class (in fact, will do this tomorrow). I bring a 50 cent piece to class. I say that I am going to flip it, and ask the probability that it will come up heads. Everyone agrees it is 0.5.

Then I flip it (onto the floor) and immediately put my foot on it. No one has seen it at this point. I ask again, what's the probability that it is heads? Most people will say 0.5, but some (particularly those that were paying attention in AP statistics) will say that it is either heads or tails, but they can't quantify it as a probability. Those that say 0.5 are thinking as Bayesians; the others are thinking as frequentists.

I then look at the coin without letting anyone else see it. I say "I now know whether it is heads or tails. What is the probability that it's heads?" Most will still say that it's 0.5. They are still thinking as Bayesians (their background information is different from mine, and they are, perhaps unconsciously, conditioning on the data they have).

I then announce what I saw and ask them, what's the probability that it's heads (suppose I saw heads). This poses something of a conundrum, since many of the students will tumble to the fact that I might not be telling the truth; so many of them will offer a higher number, 0.8 or 0.9, but not 1.0! Very sophisticated, since those students realize intuitively (without using the language) that conditioning on "professor says it is heads" isn't the same as "I saw that it is heads."

I then invite a student to look at the coin and announce what she saw. Usually the student will report the same thing I did (I always tell the truth, BTW). Same question, and the probability will go up. On one occasion the student (one of the best I ever had in 40 years of teaching) decided to report oppositely to what I said. That was fun.

Finally I let everyone take a look for themselves.

This is an exercise in Bayesian thinking (it is legitimate to quantify your uncertainty about a state of nature by putting a probability distribution it) and conditioning on data (e.g., the professor says this, the student says that, I saw the coin with my own lying eyes).

From my point of view, the distinctive feature of Bayesian thinking, as illustrated with this simple experiment, is just this: using probability distributions to quantify uncertainty about states of nature, and changing those distributions as data arrive by conditioning on the new data.

Maybe you could adapt this idea to your audience.

Another example I use early on is this one: I ask, about mammograms (the numbers are about right), suppose a woman has a mammogram. These are 90% accurate, that is, if a woman has breast cancer, there's about a 90% probability that it will be detected, and if a woman does not have cancer, there's a 90% probability that the mammogram will report that she doesn't have cancer (and a 10% probability of a false positive). I ask, what's the probability that she has cancer if the mammogram is positive? (Actually at this point the problem isn't well-posed and has no answer). Physicians asked this question will with discouraging frequency answer "90%".

I then remark that a piece of information is missing, to wit, the proportion of women in the general population that at any given time has an undetected cancer. (I point out that the general population does not include women known to have higher risk, such as those with relatives that have had the cancer, or those with one of the BRCA genes). This is about 1% (the prior probability). We then use Gerd Gigerenzer's device of "Natural Frequencies" to calculate as follows: Of 1000 women getting mammograms, 1%, or 10, will have undetected cancer and 990 will be cancer free. Of those 10 that have it, 9 will be detected (90%). Of the 990 that do not have it, 99 (10%) will get a false positive. Therefore, the probability that she actually has cancer is 0.083=9/(9+99).

I recommend Gerd Gigerenzer's book, "Calculated Risks", as a good introduction for lay people into the basic ideas behind Bayesian thinking. Gigerenzer discusses many situations of interest, including crime (the O.J. Simpson case), DNA fingerprinting, medical examples. Gigerenzer avoids using Bayes' theorem altogether, relying instead on the (mathematically equivalent) "Natural Frequencies" as illustrated in the example above. It can be read by anyone; I use it in my Freshman/Sophomore honors classes on Bayesian inference and decision theory. Parenthetically, these classes are not aimed at mathematically sophisticated students, but at the general population. I've had a journalism majors, pre-law and quite a few pre-med students, and even one dance major in that class.

My course is quite similar to the one developed at Harvard, which Andrew mentioned some time ago:

http://www.stat.harvard.edu/Academics/invitation_…

http://www.stat.columbia.edu/~cook/movabletype/ar…

Nicely put Bill, but perhaps similar to as CJ Gardin used to say

"you can't rule out an hypothsesis by the way it was generated"

"you can't really determine the usefulness of a method by the way it was motivated"

Using the motivation of conditioning on the observed data in a joint probability model for both unknowns and observed data, the credibility of the joint probability model becomes paramount with the prior part of the model usually being the less credible.

Perhaps predictable responses to this being a refusal to question the prior at all (it just needs to be someone's _anyone's_ prior) or check it in any way or even look for arguments that in the very very long run it does not matter.

On the other hand progess in applications is being seen by making priors more wrong (weakly informative) rather than less wrong …

Keith

Bill: I've actually done this demo myself (complete with peeking at the coin and asking the students: "NOW what is the probability?"). I wonder if we both read about it in the same place, or if it's just such an obvious idea that we (and, presumably others) thought of it independently.

Another (vaguely related) issue, is that, strictly speaking, the laws of conditional probability are false. Consider the two-slit experiment.

Can you explain the comment on conditional probability technically being wrong based on the two slit experiment?

Andrew: I'm pretty sure I thought this demo up independently when I was first teaching Bayesian things (even before the honors class I described). Certainly since the early '90s. It seems pretty obvious, and I'll bet you did the same.

As to the two slit experiment, it all depends on how you look at it. Leslie Ballentine wrote an article a number of years ago in The American Journal of Physics, in which he showed that conditional probability can indeed be used to analyze the two slit experiment. You just have to do it the right way. I'll see if I can find the citation for you.

Andrew: I'm pretty sure that this is the online citation for the Ballentine article I mentioned:

http://dx.doi.org/10.1119/1.14783

I'm at home and my online access to the journal isn't working at the moment, but the abstract and even the year seem right.

Andrew: That is the right citation (for the Ballentine paper).

Rich: Read the Ballentine paper, which discusses not only the comment but also its resolution.

The laws of conditional probability can yield nonsense when the prior information includes false premises like "electrons are particles".

Funny that, I like Gigerenzer's books too, especially the "Calculated Risk" one and recommend it to anybody to learn about statistical thinking; and how to do calculations without using a formal probability approach. In that particular book, Gigerenzer states on page 29 "In This book, I will focus on risks that can be quantified on the basis of frequency data".

Based on this, other comments in the book and other writings of Gigerenzer, it is my strong impression that he is a Frequentist and there is little about Bayesian thinking in his writing. In fact, I do not remember Gigerenzer ever mentioning, much less specifying, a prior distribution.

B. A. Turlach: I don't know if Gigerenzer is a Bayesian or a frequentist, but his "Calculated Risks" book definitely uses prior distributions, even though he doesn't explicitly use the term. For example, on p. 45, the right hand part of the figure calls out p(disease)=0.008 explicitly. Whether he

callsit a prior or not, that's what it is.Similarly, in Chapter 8, which discusses the O. J. Simpson trial, he's adapting a Bayesian calculation that Jack Good originally published in Science magazine (two letters to the editor). These are cited on p. 286 of the book.

As far as I can see, there isn't a significant calculation in the book that isn't just recasting a Bayesian calculation in the form of his "natural frequencies" device.

Gigerenzer is very aware that his calculations are mathematically equivalent to Bayesian ones (p. 46); he just thinks (with justification) that they are easier to explain to people who are unfamiliar with the concepts. So, for example, when someone asks me what I teach, I often use his calculation and method to show that a positive mammogram is very far from a death sentence, and I don't need a blackboard to do this. It can all be done in words. I don't have to explain conditional probability, nor Bayes' theorem, to get the idea across.

All of his priors derive either from the logical statement of the problem (e.g., Chapter 13) or from observational data (e.g., the rate of undetected breast cancers in the general population that receives mammograms). But they are still priors, even though more advanced calculations often use other principles not used in the book to choose priors.

Anyway, I have used the book successfully for many years as a starting point to teach Bayesian concepts to statistically naive students. Then I show how to go from the natural frequencies calculation (presented as a graphical tree) to an equivalent probability tree by dividing the the base population. Then I can explicitly introduce the definition of conditional probability, joint probability and prior probability. And from the tree we can read off Bayes' theorem. So, whether or not Gigerenzer himself is a Bayesian, his book is for me a great pedagogical device for teaching Bayesian statistics.

A bit more on Gigerenzer:

I first learned about the "natural frequencies" approach through two short items in

Science:http://www.sciencemag.org/cgi/content/full/290/55…

and a supportive letter:

http://www.sciencemag.org/cgi/content/full/292/55…

As I reread these short pieces, I see one important theme. Lots of people in important positions, physicians, law professionals, others, don't understand probability very well, and so can't explain things accurately to their even less-sophisticated clients very well. "Natural frequencies" are an effective way to circumvent bad ways of thinking about probability that produce bad answers, with good ways that give good answers. I don't see anything here that is particularly "frequentist." I see only Bayesian calculations cast into a more easily understood framework.

Re Bill Jefferys class experiment, I have posted on what I see as serious flaws in his reasoning at my statistics blog http://blogs.mbs.edu/fishing-in-the-bay/?p=227. At least if the point of the experiment is to show that students are naturally Bayesian, the whole exercise is a sham. If it is just to get students thinking like Bayesians, it is fine.

I will be following up with other posts on why the ridiculous claims of Bayesian superiority are unjustified. I would welcome you input folls!

I've always thought Tony O'Hagan's concise article, "Dicing with the Unknown," was the best description of the difference between Bayesian and "classical" approaches to statistics.

I was actually introduced to the article on this blog.

Bill: Guess I will answer your two comments in one go.

If you define any calculation that uses Bayes' theorem as using "Bayesian thinking", then we are all Bayesians and there are no frequentists. Thus, not surprisingly, I do not subscribe to the idea that using Bayes' theorem makes you Bayesian. :)

I learned about "natural frequencies" from Gigerenzer's book and realised that this is the approach I take when a rough back-of-the-envelope calculations is sufficient and I do not have access to a calculator. It is also a great way of teaching people who are a bit math-phobic how to do these calculations. But I do not see anything in this approach that is particularly related to "Bayesian thinking".

What you claim are prior distribution are observed frequencies; something a frequentist would be happy to use as estimates for the unknown population parameters in further calculations. For a Bayesian approach, I would require some prior to be put on those population parameters, presumably what you refer to as "more advanced calculations often use other principles not used in the book to choose priors".

The article by Tony O'Hagan that S. McKay Curtis gave us is very nice; but it seems to have a typo in the second column on the first page (just above the section heading), where Tony writes, "One characterisation of the difference between these two schools of statistical theory is that frequentists do not accept that aleatory uncertainty can be described or measured by probabilities, while Bayesians are happy to use probabilities to quantify any kind of uncertainty." Unless I am missing something, I think he meant to say that frequentists do not accept that

epistemicuncertainty can be described or measured by probabilities.A slight twist on "natural probabilities" did not work so well for me.

I would construct a fake data set of 10,000 with two variables, D+ and T+, with 30 D+ having 15 T+ and 15 T- and 9970 having D- with 9670 T- and 300 T+.

The exercise was the erase all in the data set that were not T+ and look at the precentage of D+ amongst those left. (The next exercise would be to generate a random data set using prior and data model, and just keep T+s and next step MCMC generated data set that always had T+)

Students had a great deal of difficulty with it and many felt compelled to use the data set to estimate probabilities to plug and chug through Bayes Thereom …

But my intent was to demonstrate the logic of Bayes thereom …

Think we need much more scientific reserch on how people learn/think about statistics.

Keith

I have responded to Chris Lloyd on his blog. I believe that he is greatly over-interpreting my little experiment.