**Chess ratings** are all about change. Did your rating go up, did it go down, have you reached 2000, who’s hot, who’s not, and so on. If nobody’s abilities were changing, chess ratings would be boring, they’d be nothing but a noisy measure, and watching your rating change would be as exciting as watching a graph of mildly integrated white noise.

Ratings changes are interesting because the *signal* is interesting: players are getting better or worse.

But the standard (Elo) theory of ratings is, implicitly, based on the assumption that individual abilities are constant.

So, there you have it: a method whose main purpose is to study change, is based on a static model.

This problem has been known for a long time, indeed my grad-school friend Mark Glickman worked on it (PhD thesis: “Paired Comparison Models with Time Varying Parameters”; follow-up papers, “A Comprehensive Guide to Chess Ratings,” “A State-Space Model for National Football League Scores,” and “Dynamic paired comparison models with stochastic variances”). Mark is a chess master, a magician, and a master musician, and he generalized the multiple comparisons model that underlies chess ratings, to allow for time changes in player abilities.

**Economic theory** has a similar story. Economic transactions represent local disequilibria: Person A sells object X to person B at price Y because A and B have different resources and preferences; once the object is sold, under the usual theory it will not be sold back. In that sense, economic transactions go “downhill,” and the economy would grind to a halt if new “energy” were not added into the system in the form of individuals moving, growing, being born and dying, and sor forth.

This point is not new—I’m not claiming any special insight into economics here, nor am I claiming this is some sort of bold criticism of economic theory. It’s well known that classical economics is an equilibrium theory and is thus only approximate, partly because (of course) the world is never in equilibrium, but partly because if the world ever *could* be in equilibrium, economics would become largely irrelevant. Like the Elo rating system, classical economics is an equilibrium model that is of interest because the world is not in equilibrium.

And, as with the chess ratings, people realize this. Again, I’m claiming no special insight here.

I just wanted to point out this interesting feature of methods that are used to study change but are based on static models. In a sense, it’s impressive how effective a static model can be in such settings, even while it’s clear that we should be able to do better with models that explicitly incorporate nonstationarity.

I find this post confusing. I think you are confusing statics/dynamics and equilibrium/disequilibrium. Nothing in economic theory requires that things be static. There can be an equilibrium and it will adjust as circumstances change. Thus, we move from one equilibrium to another and the usual tools of economics try to trace and disentangle these paths – such as identifying the supply and demand curves from historical equilibrium points.

There are other schools of thought (notably the work of Kornai) that critique the emphasis on equilibrium models. Equilibrium models generally say little about the adjustment to equilibrium (dynamics and disequilibrium). Arguably, these may be more interesting and more important than the static equilibria that are more typically analyzed by economists. But both types of work exist and it is debatable which is more important or relevant.

I think it is your comments about the economy grinding to a halt in equilibrium or becoming irrelevant if the economy were ever in equilibrium that confuse these matters. In equilibrium things only grind to a halt until something changes – and it always does. This need not undermine the usefulness of equilibrium theories. However, it might, as Kornai and others have argued. The issue is whether the adjustments from one equilibrium to another are more or less important than the properties of the equilibrium points themselves. And, that may differ depending on circumstances. In markets where information is symmetric and readily available, analyzing equilibria may be appropriate. The more imperfect and asymmetric information becomes, the more the dynamics and transition are really of interest and not the equilibrium points themselves.

How this affects empirical work is beyond my limited capacities at this point in my career, but I think that is what you should be focusing on. It may well be that some types of empirical analyses are more or less appropriate depending on whether you are studying diesquilibrium paths or using equilibrium points to estimate static functions that have been perturbed by exogenous factors. But I think the focus on “methods that are used to study change but are based on static models” largely misses the point.

Dale:

I can well believe I’m mixing up some terms and ideas here. I think the concepts discussed in the above post are related to each other, but I’m neither an expert nor well-read in these areas (except for the part about statistical models for time-varying parameters), so I’m just trying to emit some thoughts. I don’t see this post as any kind of definitive statement. Comments like yours are helpful.

“Person A sells object X to person B at price Y because A and B have different resources and preferences; once the object is sold, under the usual theory it will not be sold back. In that sense, economic transactions go “downhill,” and the economy would grind to a halt if new “energy” were not added into the system in the form of individuals moving, growing, being born and dying, and sor forth.”

Maybe I’m missing something here. But for many, maybe even most objects, X will either be consumed (e.g. food) or will eventually wear out (e.g. clothing, cars, appliances), necessitating new transactions to replace or repair X. So I don’t see the economy grinding to a halt as described.

Yes, objects wear out and food is consumed, so unless new stuff is built and new food is grown — “energy is added to the system” — the system will stop. I don’t think Andrew is claiming any profound insights here.

Poland <3

https://pl.wikipedia.org/wiki/Glicko

Reminds me of a little physics problem from my chemical engineering education: a large cylindrical tub with a small drain pipe at the bottom is filled with water to a given height, and the instantaneous flow rate of water out of the drain pipe is to be calculated. A technically correct calculation would model height, pressure at the exit, and flow rate as time-varying; the static approximation is that the height of water is constant, and this gives a highly accurate answer if the height of water is barely changing.

A more complex structure in a model does not necessarily lead to a more useful model, right? i.e. including non-equilibrium effects in a chess rating model is not guaranteed to make it any more accurate?

Rahul:

In the immortal words of Radford Neal,

Sometimes when a simple model outperforms a complex one, it may be time to give the complex model a shave with Occam’s razor?

Do we dwell too much on the philosophical structure of a model when we ought to focus on it’s empirical performance instead?

e.g. In the context of a post mentioning ELO ratings, what’s the baseline performance metric? What was the predictive performance of ELO in (say) predicting all competition games last year? What is it exactly that we seek to improve and how bad it the current model?

Rahul:

1. In the chess ratings example, the complex model outperforms the simple model. But if a simple model that has evident problems outperforms the complex model, then there is clearly a problem with the complex model and it should be fixed in some way, perhaps via soft constraints on parameters (i.e., informative prior distributions).

2. It’s Elo, not ELO. No big deal but Mr. Elo should get the credit here.

3. I’m no expert on this one. But, as I noted above, the Elo ratings are implicitly based on a model of unchanging abilities. A model that allows abilities to change should do better. This should show up as improved predictions, more sensible estimates, and perhaps less need for fudge factors to correct problems. The Elo system needs a bunch of fudges to keep it working, and some of these difficulties are analogous to the familiar age-period-cohort problem in demography.

To put it another way, the Glicko system was, I believe, motivated by various well-known empirical problems with the Elo system. It was not a case of art for art’s sake.

The core of my disagreement is with your #3. You seem to be judging the quality of a model on the basis of what goes into it & how the model itself looks like.

I would judge a model more on its fidelity of its outputs than the richness of its inputs & structure.

To me it is not at all self-evident that

“A model that allows abilities to change should do better”Maybe it“should”but does it in practice?PS. No offense intended Mister Arpad Elo.

PPS. When you say that in Chess

“complex ratings models outperform the simple”do you have a specific model in mind? If so it might make sense to compare that specific model’s features / performance to Elo ratings.Rahul:

As I wrote: “improved predictions, more sensible estimates, and perhaps less need for fudge factors to correct problems.” All of these relate to the outputs. Again, the point is not that the model should include various aspects for its own sake, the point is that the previously-existing method had problems, and it makes sense that these problems can be reduced by adding a dynamic component to a model.

To draw a simple analogy, suppose you had a linear model that was behaving poorly at the extremes, and suppose that in addition you had good theoretical reasons for thinking that the underlying relation is nonlinear. Then a natural step would be to move to a nonlinear model.

And, yes, of course Glickman has compared his rating system to Elo. That’s the point. This whole field is much more empirical than you seem to imagine. The improvements in the model are directly motivated by problems with the existing approach. And, of course, it is a tribute to the existing approach that it has been used extensively enough for these flaws to become apparent.

re. #2: Don’t listen to Andrew, “ELO” is just an acronym for “evaluating levels of.” Seriously, though, I wish Bugs [sic] was named after Mr. Bunny the way Stan is named after Mr. Ulam.

Also, there’s a question of what logically is likely to come first. To oversimplify, Elo was a simple and elegant model, that worked enough to get the various warring factions of chess to adopt it. And it had flaws, but these are (a) relatively minor relative the metrics that existed before, and (b) didn’t all become apparent until the system was used.

If Glickman had come first, would his system been widely adopted? Or did he need to stand on the shoulders of giants?

Similarly, equilibrium models are simpler and predate dynamic models — but I know too little about the history of economics to comment further.

Zbicyclist:

Agreed. In writing the above post, I’m not intending to say that classical economics or the Elo rating system is useless—far from it! Rather, I find it interesting that these two very useful frameworks have in common that they are static models built for the purpose of understanding and tracking changes.

Interesting post, thanks.

Given the title though I was hoping for more ‘quasi-‘ prefixes…

On a more serious note, I’m not sure what Andrew means by saying that Elo is a static model. It seems to me that its major trait in practice is how scores are dynamically updated after matches. Many chess fan sites track the ratings of players over time with the understanding that abilities change.

From a modern (i.e., 1950s) perspective, Elo looks like the Robbins-Monro stochastic update algorithm (1951) applied to a (rescaled) Bradley-Terry model (1952).

I have no idea when Elo invented Elo, but if it was later than 1952, it’s another piece of evidence for Andrew’s claim that every model someone dreams up was already developed by a psychometrician in the 1950s.

Bob:

Elo as applied is a dynamic

procedure—as you say, ratings change over time—but its estimates can be derived from a staticstatistical modelin which underlying abilities are fixed. Glickman’s point is that a dynamicmodelcan yield improved inferences in the real dynamic world.Thanks for the clarification. I understood that it’s described (at least on the Wikipedia) as a static Bradley-Terry model (though not using that terminology — Wikipedia’s not very good at connecting different definitions of the same thing).

But given the Elo procedure, couldn’t I come up with another model that it matched? Off the top of my head, I’d say it’s a way of estimating ability[t+1] given ability[t]. Of course, time isn’t really time, but number of games played. In that way, it looks like any other autoregressive model to me. Or something like how a Kalman filter is typically conceived (an update procedure over time rather than as a static hidden Markov model) or how sequential Monte Carlo attempts to fit general models with its updating procedure.

So how much better is Glickman’s rating system than Elo?

Or, to start at the basics, what’s a good quantitative metric to assess a rating system’s performance with?

There was a Kaggle competition to beat Elo at predicting matches. I don’t think you got a lot of covariate info other than which color pieces each player had. I entered a hierarchical Bradley-Terry model I built in BUGS (I was just learning stats back then, so it was pretty naive), but it didn’t do so well.

There’s a fundamental problem in the Elo model with the basics — white has an advantage and there can be ties. Black should get a bump up for a tie and white a bump down, and the bump up for winning as white should be less than winning as black.

Here’s the competition:

https://www.kaggle.com/c/chess

I was entry HiBa at 0.72 log loss, #65 of 252 on the public leaderboard based on root mean square error, whereas the winner was 0.64. In more recent competitions, Kaggle’s moved to using log loss, which is what statistical models typically fit. I wonder what the leaderboard would’ve looked like under that eval.

Here’s a description of the winning system:

http://blog.kaggle.com/wp-content/uploads/2011/02/kaggle_win.pdf

I was amused by the comment in the conclusion of the note that L-BFGS tends to overfit. It’s just an optimization procedure and you either find the optimum of the function you’re trying to optimize or you don’t. Overfitting has nothing to do with it. Technically, what’s going on is that early stopping in stochastic updates doesn’t actually fit the model in question, but pefoforms a kind of ad-hoc regularization. This kind of procedure (and confusion with model fitting) is pretty widespread in machine learning, where unlike Andrew, people are quite happy with procedures if they have good predictive properties.

Here’s the BUGS model I used:

Economics has a terminology problem. “Classical economics” refers to the economics from Adam Smith to Marx, before the idea of marginal utility and marginal product was developed around 1880. Before that, one of the very biggest questions was how to define “value” (e.g., the labor theory of value); after that, the question was seen as vacuous. People use “neoclassical economics” for modern price theory, but that’s an odd name, I think. So I don’t know what to call standard economics.

Similar phenomenon in the intellectual history of demography — the mid-20th century is full of great analyses with a “stable population model” as the starting point. I put that phrase in quotation marks because it’s a term of art referring to equilibrium in fertility and mortality rates. In reality, a stable population can be growing or shrinking. (For the constant-sized population with constant fertility and mortality rates, you get an even more simplified stationary-population model.)

During the 1970s, several demographers worked to extend the stable population model to cases where mortality rates fluctuate across time and age — i.e., the real world. Eventually, Ansley Coale and Sam Preston came up with a new synthesis (published in 1982 http://www.jstor.org/stable/2735961 ). Only cited 184 times, according to Google Scholar. Sometimes the citation gods do not really want a closer approximation to reality.