Of forecasts and graph theory and characterizing a statistical method by the information it uses

Wayne Folta points me to “EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball” and writes, “I [Folta] have got to believe that he’s simply re-invented a statistical method in a graph-ish context, but don’t know enough to judge.”

I have not looked in detail at the method being presented here—I’m not much of college basketball fan—but I’d like to use this as an excuse to make one of my favorite general point, which is that a good way to characterize any statistical method is by what information it uses.

The basketball ranking method here uses score differentials between teams in the past season. On the plus side, that is better than simply using one-loss records (which (a) discards score differentials and (b) discards information on who played whom). On the minus side, the method appears to be discretizing the scores (thus throwing away information on the exact score differential) and doesn’t use any external information such as external ratings.

Anyway, my point is that the writeup of the method focuses on statistical operations (forming a matrix of a graph, computing eigensomethingorothers), and, sure, something like that is necessary, but to me, what’s interesting is to know what information went into the rankings.

P.S. If I wanted to use the information that this guy was using, I’d probably just fit a simple normal linear model with a latent parameter for each team.

1. Hyokun Yun says:

If you use normal linear model with a latent parameter, then it would be an Elo rating, isn’t it? http://en.wikipedia.org/wiki/Elo_rating_system

• Andrew says:

Hyokun:

Nope. The Elo rating (that is, the Bradley-Terry model) is a logistic regression for discrete outcomes. Here I think it’s better to use the (essentially) continuous data of exact score differentials, hence a normal or some other continuous model.

2. Jonathan (another one) says:

The use of score differentials in rating systems used by the NCAA is controversial because it is felt that teams will ‘run up the score’ to improve their ratings. Thus, score differentials are barred in any of the computer systems used in the BCS polls. Of course, this doesn’t stop anyone from using them to get more accurate ratings than the NCAA ratings, and lots of people do.

3. Phil says:

Andrew, your comment about not looking closely at the method because you’re not very interested in college basketball is almost a non sequitur given your interest in statistical methods. Lots of kids read your blog, don’t encourage them to only learn about statistics if they’re interested in the specific application!

• Andrew says:

Hey, kids. Be cool—stay in school.

4. Guy Srinivasan says:

“one of my favorite general point”

Can we get a post on quick summaries of your current favorite general points?

5. I happen to know that the method you suggest works well for the ponies.

6. bob says:

Does “one-loss records” refer to a particular performance statistic that I’m not familiar with (and which would not apply, presumably, only to teams with either no losses or more than one loss), or is this a simple homonym-based “thinko”?

7. kjetil halvorsen says:

About graph theory and statistics. I don’t know if this is relevant for the original question here,
but Jan deLeeuw has written a lot about graph theoretical models for correspondence analysis.

This is interesting because it makes connection with apparently unrelated things.
Somo computer folks are interested ia drawing graphs, see
http://www.graphdrawing.org/

correspondence analysis can be seen as a graph drawing problem, using least squares
to get an effective algorithm. Lots of computer people instead uses L_1 norm
and gets a difficult problem, which they try to show is NP or whatever.

Statisticians are more practical, changing the norm to get a solvable problem!