One of the new examples for the third edition of Bayesian Data Analysis is a spell-checking story. Here it is (just start at 2/3 down on the first page, with “Spelling correction”).

I like this example—it demonstrates the Bayesian algebra, also gives a sense of the way that probability models (both “likelihood” and “prior”) are constructed from existing assumptions and data. The models aren’t just specified as a mathematical exercise, they represent some statement about reality. And the problem is close enough to our experience that we can consider ways in which the model can be criticized and improved, all in a simple example that has only three possibilities.

If you’re interested in the problem, you might also check out Peter Norvig’s short essay on spelling correction. The model’s a simple example of Shannon’s 1949 noisy channel model.

You can find a slightly more linguistically nuanced description in the LingPipe spell checking tutorial I wrote , along with reasonably scalable and configurable Java code. And also a discussion and implementation of auto-completion.

I also very much like this Toutanova and Moore paper which incorporates information about pronunciation into the channel model. And my favorite source model for this kind of thing is Frank Wood et al.’s sequence memoizer.

Bob: This is really interesting stuff. Do you know of any applications of this sort of approach to historical text documents? I ask out of self-interest, as I’ve been working with some historical documents in which there are both misspellings and alternative spellings and lots of dialect variation. This seems like it could be approached like a “spellcheck” problem.

I know very little about historical document text analysis. But these techniques are bread and butter techniques in natural language processing, so I’d be surprised if they hadn’t been applied.

What problem are you trying to solve? If it’s just matching words, you can often do that with bags of n-grams pretty effectively without having any frequency information to base decisions on. Especially if you can put in some human coding to do it in a semi-supervised way.

Spell checkers, like speech recognizers, rely very heavily on having a good source model, i.e., probability model for what’s expected in the language. One reason is that you can train that in a pretty unsupervised way from a pile of data. The channel model, i.e, the model of errors that are made in spelling (or sounds produced in speech), are usually much weaker and less reliable on their own. Not to mention much harder to compute.

The structure of my problem is maybe unusual, but I’m hoping tractable, since it’s focused on personal names. Here’s way more info that you asked for.

I want to estimate historical migration rates among towns in the Faroe Islands, using a few hundred years of parish records of births/baptisms and marriages and deaths. You can see a similar document on the wikipedia page about parish registers (http://en.wikipedia.org/wiki/Parish_register). In principle, there’s a lot of information about population movement, if I can assign identities to names that span different parishes.

But the names are sometimes spelled in different ways in different places and by different clerks (and handwriting quality varies). And there aren’t that many personal names, so sometimes cross-referencing a birth to a marriage is guess work. Sometimes the clerk was Danish, so wrote the register in Danish name style, but another clerk might have written a phonetic Faroese name. Add on top of all this that there is substantial dialect variation in the Faroes, and so if a person moves to a new town, it’s not unusual for their name at marriage to be spelled slightly differently than their name at birth.

I was thinking frequencies of names and knowledge of dialect shifts, in addition to knowledge of the spatial structure of the society, could go along way to quantifying uncertainty in identity assignments.

Fascinating problem. I think you’re right that these kinds of techniques will work. They’ll work even better if you can provide some kind of prior characterization of the changes you expect to see.

I’ve never worked on something like this myself, but can give you some leads.

1. Peter Bol and Stu Shieber (Harvard East Asian languages and comp sci, respectively) worked on a similar problem for Chinese civil service historical records. They pulled out place names and linked them to locations to trace migration due to civil service over hundreds of years in China.

2. I was chatting with Geoff Nichols (Oxford Stats) at MCMski and he was telling me about a project he was working on to reconstruct language change using phylogenetic techniques that involved tracking spelling variations historically.

You could certainly do worse than asking these guys.

There is a huge amount of work on transliteration, most of which as far as I can tell is funded by DARPA to track Chinese and Arabic names when transliterated into English. Most of it uses these same kinds of techniques.

These applications of Bayes’ rule to assess and update relative frequencies of events are also available to a “non-Bayesian”. So there seems no disagreement there. As Wasserman emphasizes, there’s a difference between using Bayes’ rule and being a Bayesian about statistical inference.

Mayo:

Who’s talking about disagreement? Every method is available to everybody. If somebody wants to read our book and use our methods and not call him or herself a “Bayesian,” that’s fine with me. Whatevs.

In fact, most inference for these kinds of problems is not Bayesian, at least in the sense of treating parameters probabilistically and using posterior uncertainty in inference. They estimate with posterior modes, or what a “non-Bayesan” might call penalized maximum likelihood. But that’s mainly for computational reasons, not matters of principle.

I also find it deeply ironic that “naive Bayes” is almost never done with Bayesian inference, leading to all sorts of confusion in the machine learning literature and very confusing names like “Bayesian spam filtering”.

Yes, just to be clear: what I like about this example is not that it “sells” Bayes—at this point, I have enough examples of this sort, where Bayesian methods solve problems that were not easily solved in other ways—but rather because, as I wrote above, it demonstrates how Bayesian data analysis works in the context of probabilities (both the “likelihood” and the “prior”) that were constructed from data. I like having an intro example that goes beyond simple data models such as the binomial and simple prior information such as “p has to be somewhere between 0 and 1.”

Andrew, thanks for posting this excerpt (and great business decision ;) I’ve had your book on my Amazon wish list since publication, but was frustrated by CRC’s decision not to provide “Look Inside” views. (I almost never buy books completely on spec.) I liked the writing style & exposition, & immediately 1-clicked it!