Jeff pointed me to this article by Nick Berry. It’s kind of fun but of course if you know your opponent will be following this strategy you can figure out how to outwit it. Also, Berry writes that ETAOIN SHRDLU CMFWYP VBGKQJ XZ is the “ordering of letter frequency in English language.” Indeed this is the conventional ordering but nobody thinks it’s right anymore. See here (with further discussion here). I wonder what corpus he’s using.

P.S. Klutz was my personal standby.

Somewhat relevant: Marek Hlavac wrote up a version of Hangman for STATA. You just download the .do flies at https://sites.google.com/site/marekhlavac/stata_games

It’s not the frequency of a letter that matters most for hangman, but rather how easy it will be to guess the word once you find out it includes/excludes a given letter. That’s how I play, at any rate.

“Rhythm” is my favorite.

“Queueing” is good too.

“Coffee”

I agree with MAYO: even if you ignore that the opponent is trying to pose words that will defeat this type of strategy, and even if you decide that a greedy algorithm (which is almost certainly suboptimal) is practical from a computational point of view, you should take into account how much information will be obtained if your guess is wrong – an approach based on decision trees might be more convincing. And the best strategy is also a function of the number of guesses remaining.

Agreeing on a legal dictionary ahead of time is trickier than you might think, even in English. New words can be produced by adding suffixes and prefixes to existing words. For instance, check out a blog entry where I list dozens of words found in mainstream English newswire derived from the stem “author”, including words like “multiauthored” and “authoritativeness”. This is even trickier in languages like Russian (highly inflected) or Turkish (highly agglutinative).

Next, suppose you know the probability distribution of the words making up puzzles. Again, a really big step.

Then, you have an obvious near optimal greedy strategy. At each guess, look at the set of legal remaining words and find the letter that most neatly divides the probability mass in two. Guess that letter and continue.

Even this isn’t optimal. You need to overcome the greediness bias and unfold the entire game tree to choose an optimal set of guesses leading to a guaranteed minimal number of steps. All very finite, but combinatorially prohibitive for longer words (though there are probably heuristics, just like for Sudoku).

[…] (HT: Andrew Gelman) […]

@Bob: That’s more or less what I had in mind, except remember that you what matters is not the number of guesses but the number of _wrong_ guesses. So instead of roughly halving the probability mass you want to make guesses that are very likely correct and provide lots of information when wrong. The amount of information gained when guessing right doesn’t matter much, because there is no cost. The original article is closer to this than halving the prob mass.

What may be a tricky detail (making the greedy approach less suitable) is that one should not try to minimize the number of wrong guesses but rather maximize the probability of completing the word without exceeding N wrong guesses. Seems to me this encourages building the tree out to N levels (remembering that only wrong guesses increase the level – this makes the branching factor huge) rather than being greedy – the greedy approach may often be bad by risking a wrong guess in order to gain information which ends up being unimportant (e.g. strongly correlated with informaton you have to take a risk to gain later anyway).

Where the problem becomes interesting is when N is large enough for full tree exploration to be intractable. It is not obvious what evaluation function to use when terminating the search at a non-leaf node. Number of possible words remaining is one option, but not necessarily best.

I’d say that we should connect this thread to our research discussions on the death penalty, but that would be in poor taste.