During the 1970s, several demographers worked to extend the stable population model to cases where mortality rates fluctuate across time and age — i.e., the real world. Eventually, Ansley Coale and Sam Preston came up with a new synthesis (published in 1982 http://www.jstor.org/stable/2735961 ). Only cited 184 times, according to Google Scholar. Sometimes the citation gods do not really want a closer approximation to reality.

]]>There was a Kaggle competition to beat Elo at predicting matches. I don’t think you got a lot of covariate info other than which color pieces each player had. I entered a hierarchical Bradley-Terry model I built in BUGS (I was just learning stats back then, so it was pretty naive), but it didn’t do so well.

There’s a fundamental problem in the Elo model with the basics — white has an advantage and there can be ties. Black should get a bump up for a tie and white a bump down, and the bump up for winning as white should be less than winning as black.

Here’s the competition:

https://www.kaggle.com/c/chess

I was entry HiBa at 0.72 log loss, #65 of 252 on the public leaderboard based on root mean square error, whereas the winner was 0.64. In more recent competitions, Kaggle’s moved to using log loss, which is what statistical models typically fit. I wonder what the leaderboard would’ve looked like under that eval.

Here’s a description of the winning system:

http://blog.kaggle.com/wp-content/uploads/2011/02/kaggle_win.pdf

I was amused by the comment in the conclusion of the note that L-BFGS tends to overfit. It’s just an optimization procedure and you either find the optimum of the function you’re trying to optimize or you don’t. Overfitting has nothing to do with it. Technically, what’s going on is that early stopping in stochastic updates doesn’t actually fit the model in question, but pefoforms a kind of ad-hoc regularization. This kind of procedure (and confusion with model fitting) is pretty widespread in machine learning, where unlike Andrew, people are quite happy with procedures if they have good predictive properties.

Here’s the BUGS model I used:

model { delta ~ dnorm(0,1) I(0,) gamma ~ dnorm(0,1) I(0,) tau ~ dnorm(0,1) I(0,) for (j in 1:J) { alpha[j] ~ dnorm(0,tau) } for (n in 1:N) { qw[n] <- exp((alpha[W[n]] + gamma) - alpha[B[n]]) qb[n] <- exp(alpha[B[n]] - (alpha[W[n]] + gamma)) qd[n] <- exp(delta) z[n] <- qw[n] + qb[n] + qd[n] p[n,1] <- min(1.0,max(0.0,qw[n]/z[n])) p[n,2] <- min(1.0,max(0.0,qb[n]/z[n])) p[n,3] <- min(1.0,max(0.0,qd[n]/z[n])) y[n] ~ dcat(p[n,]) } }]]>

Thanks for the clarification. I understood that it’s described (at least on the Wikipedia) as a static Bradley-Terry model (though not using that terminology — Wikipedia’s not very good at connecting different definitions of the same thing).

But given the Elo procedure, couldn’t I come up with another model that it matched? Off the top of my head, I’d say it’s a way of estimating ability[t+1] given ability[t]. Of course, time isn’t really time, but number of games played. In that way, it looks like any other autoregressive model to me. Or something like how a Kalman filter is typically conceived (an update procedure over time rather than as a static hidden Markov model) or how sequential Monte Carlo attempts to fit general models with its updating procedure.

]]>Or, to start at the basics, what’s a good quantitative metric to assess a rating system’s performance with?

]]>Bob:

Elo as applied is a dynamic *procedure*—as you say, ratings change over time—but its estimates can be derived from a static *statistical model* in which underlying abilities are fixed. Glickman’s point is that a dynamic *model* can yield improved inferences in the real dynamic world.

From a modern (i.e., 1950s) perspective, Elo looks like the Robbins-Monro stochastic update algorithm (1951) applied to a (rescaled) Bradley-Terry model (1952).

I have no idea when Elo invented Elo, but if it was later than 1952, it’s another piece of evidence for Andrew’s claim that every model someone dreams up was already developed by a psychometrician in the 1950s.

]]>Given the title though I was hoping for more ‘quasi-‘ prefixes…

]]>re. #2: Don’t listen to Andrew, “ELO” is just an acronym for “evaluating levels of.” Seriously, though, I wish Bugs [sic] was named after Mr. Bunny the way Stan is named after Mr. Ulam.

]]>Rahul:

As I wrote: “improved predictions, more sensible estimates, and perhaps less need for fudge factors to correct problems.” All of these relate to the outputs. Again, the point is not that the model should include various aspects for its own sake, the point is that the previously-existing method had problems, and it makes sense that these problems can be reduced by adding a dynamic component to a model.

To draw a simple analogy, suppose you had a linear model that was behaving poorly at the extremes, and suppose that in addition you had good theoretical reasons for thinking that the underlying relation is nonlinear. Then a natural step would be to move to a nonlinear model.

And, yes, of course Glickman has compared his rating system to Elo. That’s the point. This whole field is much more empirical than you seem to imagine. The improvements in the model are directly motivated by problems with the existing approach. And, of course, it is a tribute to the existing approach that it has been used extensively enough for these flaws to become apparent.

]]>The core of my disagreement is with your #3. You seem to be judging the quality of a model on the basis of what goes into it & how the model itself looks like.

I would judge a model more on its fidelity of its outputs than the richness of its inputs & structure.

To me it is not at all self-evident that *“A model that allows abilities to change should do better”* Maybe it *“should”* but does it in practice?

PS. No offense intended Mister Arpad Elo.

PPS. When you say that in Chess *“complex ratings models outperform the simple”* do you have a specific model in mind? If so it might make sense to compare that specific model’s features / performance to Elo ratings.

Rahul:

1. In the chess ratings example, the complex model outperforms the simple model. But if a simple model that has evident problems outperforms the complex model, then there is clearly a problem with the complex model and it should be fixed in some way, perhaps via soft constraints on parameters (i.e., informative prior distributions).

2. It’s Elo, not ELO. No big deal but Mr. Elo should get the credit here.

3. I’m no expert on this one. But, as I noted above, the Elo ratings are implicitly based on a model of unchanging abilities. A model that allows abilities to change should do better. This should show up as improved predictions, more sensible estimates, and perhaps less need for fudge factors to correct problems. The Elo system needs a bunch of fudges to keep it working, and some of these difficulties are analogous to the familiar age-period-cohort problem in demography.

To put it another way, the Glicko system was, I believe, motivated by various well-known empirical problems with the Elo system. It was not a case of art for art’s sake.

]]>Sometimes when a simple model outperforms a complex one, it may be time to give the complex model a shave with Occam’s razor?

Do we dwell too much on the philosophical structure of a model when we ought to focus on it’s empirical performance instead?

e.g. In the context of a post mentioning ELO ratings, what’s the baseline performance metric? What was the predictive performance of ELO in (say) predicting all competition games last year? What is it exactly that we seek to improve and how bad it the current model?

]]>Zbicyclist:

Agreed. In writing the above post, I’m not intending to say that classical economics or the Elo rating system is useless—far from it! Rather, I find it interesting that these two very useful frameworks have in common that they are static models built for the purpose of understanding and tracking changes.

]]>Also, there’s a question of what logically is likely to come first. To oversimplify, Elo was a simple and elegant model, that worked enough to get the various warring factions of chess to adopt it. And it had flaws, but these are (a) relatively minor relative the metrics that existed before, and (b) didn’t all become apparent until the system was used.

If Glickman had come first, would his system been widely adopted? Or did he need to stand on the shoulders of giants?

Similarly, equilibrium models are simpler and predate dynamic models — but I know too little about the history of economics to comment further.

]]>Rahul:

In the immortal words of Radford Neal,

]]>Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Yes, objects wear out and food is consumed, so unless new stuff is built and new food is grown — “energy is added to the system” — the system will stop. I don’t think Andrew is claiming any profound insights here.

]]>https://pl.wikipedia.org/wiki/Glicko ]]>

Dale:

I can well believe I’m mixing up some terms and ideas here. I think the concepts discussed in the above post are related to each other, but I’m neither an expert nor well-read in these areas (except for the part about statistical models for time-varying parameters), so I’m just trying to emit some thoughts. I don’t see this post as any kind of definitive statement. Comments like yours are helpful.

]]>Maybe I’m missing something here. But for many, maybe even most objects, X will either be consumed (e.g. food) or will eventually wear out (e.g. clothing, cars, appliances), necessitating new transactions to replace or repair X. So I don’t see the economy grinding to a halt as described.

]]>There are other schools of thought (notably the work of Kornai) that critique the emphasis on equilibrium models. Equilibrium models generally say little about the adjustment to equilibrium (dynamics and disequilibrium). Arguably, these may be more interesting and more important than the static equilibria that are more typically analyzed by economists. But both types of work exist and it is debatable which is more important or relevant.

I think it is your comments about the economy grinding to a halt in equilibrium or becoming irrelevant if the economy were ever in equilibrium that confuse these matters. In equilibrium things only grind to a halt until something changes – and it always does. This need not undermine the usefulness of equilibrium theories. However, it might, as Kornai and others have argued. The issue is whether the adjustments from one equilibrium to another are more or less important than the properties of the equilibrium points themselves. And, that may differ depending on circumstances. In markets where information is symmetric and readily available, analyzing equilibria may be appropriate. The more imperfect and asymmetric information becomes, the more the dynamics and transition are really of interest and not the equilibrium points themselves.

How this affects empirical work is beyond my limited capacities at this point in my career, but I think that is what you should be focusing on. It may well be that some types of empirical analyses are more or less appropriate depending on whether you are studying diesquilibrium paths or using equilibrium points to estimate static functions that have been perturbed by exogenous factors. But I think the focus on “methods that are used to study change but are based on static models” largely misses the point.

]]>