Konrad Scheffler writes: I was interested by your paper “Induction and deduction in Bayesian data analysis” and was wondering if you would entertain a few questions:

**jaynes**

## Standardized writing styles and standardized graphing styles

Back in the 1700s—JennyD can correct me if I’m wrong here—there was no standard style for writing. You could be discursive, you could be descriptive, flowery, or terse. Direct or indirect, serious or funny. You could construct a novel out of letters or write a philosophical treatise in the form of a novel. Nowadays there […]

## Kind of Bayesian

Astrophysicist Andrew Jaffe pointed me to this and discussion of my philosophy of statistics (which is, in turn, my rational reconstruction of the statistical practice of Bayesians such as Rubin and Jaynes). Jaffe’s summary is fair enough and I only disagree in a few points:

## Early stopping and penalized likelihood

Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data.

This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins.

In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. Here’s an example of what you might want:

The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision.

Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. By stopping early, you’re compromising between the prior and the data estimates.

Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function.

A key question, though, is where to stop? Different rules correspond to different solutions. For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. This would stop some of the more extreme varieties of overfitting. Such a procedure, however, is *not* the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter.

I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. The particular models and methods discussed in this article are not really of interest any more, but I think the general principles are still relevant.

The key ideas of the article appear in section 5 on page 433.

**Non-statistical content**

## The greatest works of statistics never published

The other day I came across a paper that referred to Charlie Geyer’s 1991 paper, “Estimating Normalizing Constants and Reweighting Mixtures in Markov Chain Monte Carlo.” I expect that part or all of this influential article was included in some published paper, but I only know it as a technical report–which at the time of […]

## Philosophy and the practice of Bayesian statistics

Here’s an article that I believe is flat-out entertaining to read. It’s about philosophy, so it’s supposed to be entertaining, in any case.

Here’s the abstract:

A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science.

Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.

Here’s the background:

## “Standard textbooks on chemistry do not discuss subjectivity in their introductions, and so statistical textbooks need not to do that either”

Kevin Spacey famously said that the greatest trick the Devil ever pulled was convincing the world he didn’t exist. When it comes The Search for Certainty, a new book on the philosophy of statistics by mathematician Krzysztof Burdzy, the greatest trick involved was getting a copy into the hands of Christian Robert, who trashed it on his blog and then passed it on to me.

The flavor of the book is given from this quotation from the back cover: “Similarly, the ‘Bayesian statistics’ shares nothing in common with the ‘subjective philosophy of probability.” We actually go on and on in our book about how Bayesian data analysis does not rely on subjective probability, but . . . “nothing in common,” huh? That’s pretty strong.

Rather than attempt to address the book’s arguments in general, I will simply do two things. First, I will do a “Washington read” (as Yair calls it) and see what Burdzy says about my own writings. Second, I will address the question of whether Burdzy’s arguments will have any effect on statistical practice. If the answer to the latter question is no, we can safely leave the book under review to the mathematicians and philosophers, secure in the belief that it will do little mischief.

## Influential statisticians

Seth lists the statisticians who’ve had the biggest effect on how he analyzes data: 1. John Tukey. From Exploratory Data Analysis I [Seth] learned to plot my data and to transform it. A Berkeley statistics professor once told me this book wasn’t important! 2. John Chambers. Main person behind S. I [Seth] use R (open-source […]

## New book on Bayesian nonparametrics

Nils Hjort, Chris Holmes, Peter Muller, and Stephen Walker have come out with a new book on Bayesian Nonparametrics. It’s great stuff, makes me realize how ignorant I am of this important area of statistics. Here are the chapters:

0. An invitation to Bayesian nonparametrics (Hjort, Holmes, Muller, and Walker)

1. Bayesian nonparametric methods: motivation and ideas (Walker)

2. The Dirichlet process, related priors and posterior asymptotics (Subhashis Ghosal)

3. Models beyond the Dirichlet process (Antonio Lijoi and Igor Prunster)

4. Further models and applications (Hjort)

5. Hierarchical Bayesian nonparametric models with applications (Yee Whye Teh and Michael I. Jordan)

6. Computational issues arising in Bayesian nonparametric hierarchical models (Jim Griffin and Chris Holmes)

7. Nonparametric Bayes applications to biostatistics (David Dunson)

8. More nonparametric Bayesian models for biostatistics (Muller and Fernando Quintana)

I have a bunch of comments, mostly addressed at some offhand remarks about Bayesian analysis made in chapters 0 and 1. But first I’ll talk a little bit about what’s in the book.

## The difference between complete ignorance (p=1/2) and certainty that p=1/2

Ryan Richt writes: I wondered if you have a quick moment to dig up an old post of your own that I cannot find by searching. I read an entry where you discussed if there really was a difference between a prior of 1/2 meaning that we have no knowledge of a coin flip, or […]

## Bayes, Jeffreys, prior distributions, and the philosophy of statistics

Christian Robert, Nicolas Chopin, and Judith Rousseau wrote this article that will appear in Statistical Science with various discussions, including mine.

I hope those of you who are interested in the foundations of statistics will read this. Sometimes I feel like banging my head against a wall, in my frustration in trying to communicate with Bayesians who insist on framing problems in terms of the probability that theta=0 or other point hypotheses. I really feel that these people are trapped in a bad paradigm and, if they would just think things through based on first principles, they could make some progress. Anyway, here’s what I wrote:

I actually own a copy of Harold Jeffreys’s

Theory of Probabilitybut have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng, and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin, and Rousseau as a platform for further discussion of foundational issues.In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in the process of data analysis

## More Bayes rants

John Skilling wrote this response to my discussion and rejoinder on objections to Bayesian statistics. John’s claim is that Bayesian inference is not only a good idea but also is necessary. He justifies Bayesian inference using the logical reasoning of Richard Cox (1946), which I believe is equivalent to the von Neumann and Morgenstern (1948) […]

## Bayes and risk

Someone writes in with the following question:

## Bayesian inference of the median

Median often feels like an ad hoc calculation, not like an aspect of a statistical model. But in fact, median actually corresponds to a model. Last week, Risi and Pannagadatta at the Columbia machine learning journal club reminded me that L1 norm of the data is minimized at the point of the median. But the […]

## Statistical learning theory vs Bayesian statistics

I have come across Vapnik vs Bayesian Machine Learning – a set of notes by the philosopher of science David Corfield. I agree with his notes, and find them quite balanced, although they are not necessarily easy reading. My personal view is that SLT derives from attempts to mathematically characterize the properties of a model, […]

## Richard Berk’s book on regression analysis

I just finished reading Dick Berk’s book, “Regression analysis: a constructive critique” (2004). It was a pleasure to read, and I’m glad to be able to refer to it in our forthcoming book. Berk’s book has a conversational format and talks about the various assumptions required for statistical and causal inference from regression models. I was disappointed that the book used fake data–Berk discussed a lot of interesting examples but then didn’t follow up with the details. For example, Section 2.1.1 brought up the Donohue and Levitt (2001) example of abortion and crime, and I was looking forward to Berk’s more detailed analysis of the problem–but he never returned to the example later in the book. I would have learned more about Berk’s perspective on regression and causal inference if he were to apply it in detail to some real-data examples. (Perhaps in the second edition?)

I also had some miscellaneous comments:

## One more time on Bayes, Popper, and Kuhn

There was a lot of fascinating discussion on this entry from a few days ago. I feel privileged to be able to get feedback from scientists with different perspectives than my own. Anyway, I’d like to comment on some things that Dan Navarro wrote in this discussion. Not to pick on Dan but because I think his comments, and my responses, may highlight some different views about what is meant by “Bayesian inference” (or, as I would prefer to say, “Bayesian data analysis,” to include model building and model checking as well as inference).

So here goes . . .

## Bayes and Popper

Is statisticsl inference inductive or deductive reasoning? What is the connection between statistics and the philosophy of science? Why do we care? The usual story Schools of statistical inference are sometimes linked to philosophical approaches. “Classical” statistics–as exemplified by Fisher’s p-values and Neyman’s hypothesis tests–is associated with a deductive, or Popperian view of science: a […]