Some open questions in chess

– When played optimally, is pawn race a win for white, black, or a draw?

– Could I beat a grandmaster if he was down a queen? I tried playing Stockfish with it down a Q and a R, and I won easily. (Yeah!) I suspect I could beat it just starting up a queen. But Stockfish plays assuming I’ll play optimally. Essentially it’s just trying to lose gracefully. In contrast, I’m pretty sure the grandmaster would know I’m not good, so he could just complicate the position and tear me apart. Perhaps someone’s already written a chess program called Patzer that plays kinda like me, and then another program called Hustler that can regularly defeat Patzer.

– If I was playing a top player, and my only goal was to last as many moves as possible, and his goal was to beat me in as few moves as possible, how long could I last?

– In bughouse, which is better, 4 bishops or 4 knights?

– How best to set up maharajah to make it balanced? Maybe the non-maharajah team isn’t allowed to promote and they have to win in some prespecified number of moves?

– And then there are the obvious ones:
1. Under perfect play, is the game a draw?
2. How much worse is the best current computer, compared to perfect play?
3. Could a human ever train to play the best computer to a draw? If not, how much worse etc?

– Do you have any good open questions in chess? If so, please share them in the comments.

We were discussing such questions after seeing this amusing article by Tom Murphy on the longest chess game (“There are jillions of possible games that satisfy the description above and reach 17,697 moves; here is one of them”) and also this fun paper comparing various wacky chess engines such as “random_move,” “alphabetical,” and “worstfish” and which begins:
“CCS Concepts: • Evaluation methodologies → Tournaments; • Chess → Being bad at it;
Additional Key Words and Phrases: pawn, horse, bishop, castle, queen, king.” I’ve been told that here’s an accompanying video but I never have the patience to watch videos.

One thing, though. Murphy writes, “Fiddly bits aside, it is a solved problem to maintain a numeric skill rating of players for some game (for example chess, but also sports, e-sports, probably also z-sports if that’s a thing). Though it has some competition (suggesting the need for a meta-rating system to compare them), the Elo Rating System is a simple and effective way to do it.” I know it’s a jokey paper but I just want to remind youall that these rating systems are not magic: you can think of them as mathematical algorithms or as fits of statistical models, but in any case they don’t always make sense, as can be seen from a simple hypothetical example of program A which always beats B, and B always beats C, and C always beats A. Murphy actually discusses this sort of example in his article, so he’s aware of the imperfections of ratings. I just wanted to bring this one up because chess rating is an example of the fallacy of measurement, by which people think that when something is measured, it must represent some underlying reality.

P.S. Also don’t forget Tim Krabbé’s chess records page, which unfortunately hasn’t been updated for over three years (at the time of this writing). Chess games can’t be copyrighted, so the youngest professor nationwide could collect some of this material in a book and put his name on it!

38 thoughts on “Some open questions in chess

    • Phil:

      I guess that losing to Nakamura at queen odds would be kinda fun, just to see how he does it. The next question is: could he beat me at queen odds consistently, or could I settle into some equilibrium where I’d be able to figure out how to preserve the advantage and grind him down?

    • Jaime:

      I’m thinking of the pawn race game, where each side starts out with 8 pawns in standard position and nothing else, then you take turns moving and whoever who promotes first wins.

  1. How far from perfect are current programs? The top ones still manage to lose to each other, right? But the question is interesting because brute force always encounters a horizon, and evaluation at that point is always inexact. I guess the question comes down to whether that inexactness can ever be reduced to a level at which there’s no practical significance. For instance, strong players have a sense of the “draw margin”, the degree of inequality that still permits the player with an inferior position to draw the game. (This margin is highly sensitive to the nature of the position on the board — major piece ending, bishops of opposite colors, rook endings down a pawn, etc.) The point is that there are just three outcomes to a game of chess, not a continuum, and perfection can either be practical or absolute. But if all we have are games between superhuman programs to assess move quality, we won’t know the difference between the two.

    • > brute force always encounters a horizon

      I don’t think that’s obviously true. Many games that probably seemed intractable a decade or two ago are now solved (I.e. brute forced to the last move with some clever hacks), most notably checkers. So maybe solving chess is a possibility if someone figures out how to do search efficiently on aGPU or something similar.

        • The wiki link says that an upper bound on number of positions is 10^47 and the number of sensible games around 10^40. That’s a lot, but with some breakthroughs in pruning (best-case alpha-beta means halving the exponent) could be borderline manageable. The wiki article on solving chess https://en.m.wikipedia.org/wiki/Solving_chess states clearly, that solving chess in the near future is considered unlikely, but there are no convincing arguments that it is impossible.

        • Phil:

          Alpha-beta pruning has been around long enough that I think it can be considered brute force. In the same sense that using a computer is brute force, even though it uses hi-tech transistors, Or solving a math problem using calculus is brute force. Etc.

        • In conventional computer science lingo, “brute force” means an exhaustive search. Here’s what Wikipedia says: “In computer science, brute-force search or exhaustive search, also known as generate and test, is a very general problem-solving technique and algorithmic paradigm that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem’s statement.”

          Using a pruning algorithm is, obviously, not exhaustive: in principle, at least, you have to worry that your algorithm prunes something it shouldn’t.

        • Phil:

          But alpha-beta pruning is exhaustive: it’s not a probabilistic algorithm, it only eliminates options that are strictly dominated according to the defined objective function.

        • But what is the objective function? If you are doing brute force, the objective function has (something like) a value of 1 if you win, 0 if you draw, -1 if you lose. If your objective function isn’t that — if you add 0.3 for a passed pawn and 1.2 for connected passed pawns, and 0.7 if you have the two bishops, and -0.8 if your opponent has a knight which has a potential outpost in your camp, and so on, then you aren’t doing brute force.

          You have to play all the way to checkmate — or to a known position from which a forced checkmate is known, or a known draw, or a draw by insufficient material or repeated moves or the 50-move rule — or you aren’t doing brute force.

        • Phil:

          Sure. My point is that alpha-beta pruning does not make anything less of a brute force algorithm. Alpha-beta pruning or full tree search will both give you the minimax for the given objective function. If you want to define it as being brute force only if you use the -1/0/1 objective function, then that’s fine—and an alpha-beta-pruning algorithm that optimizes over the -1/0/1 objective function is also brute force.

        • I understand alpha-beta pruning.

          If you only prune a branch when you’ve reached a proven win, loss, or draw, then sure, that’s still brute force. But how much does that save you compared to no pruning at all? The endgame tablebase is very large in absolute terms but not big enough to make a substantial difference in reducing the tree search you would have to do if you want to search every possible chess game.

  2. Is the increase in draws in top engine play (as the engines get better) in some sense statistical evidence for the game being a draw with perfect play?

  3. An anecdote from the world of professional Go (pre Alpha Go). A top Japanese pro was asked
    how many stones handicap he would need if he was playing God (who always makes optimal moves).
    He thought 3 stones. But he said that he would need six stones if playing the Devil, who always
    makes the trickiest moves.

  4. I am reminded of a fun app I play sometimes called Really Bad Chess (http://www.reallybadchess.com/). It essentially keeps the AI constant and adjusts the difficulty by randomizing the pieces (subject to some constraints) such that easier difficulties get more powerful pieces. On the easiest difficulty you might end up with four rooks and two queens against a bunch of knights and bishops.

  5. Elo is simple. But, there are better rating systems. For example,

    https://www.ratingscentral.com/HowItWorks.php

    http://www.glicko.net/glicko.html

    If you just have a fixed data set, I’d pick a model and find the posterior mode (rather than using a rating system).

    I think Tom Murphy probably shouldn’t have used the Elo update formula. He notes the problems that he had with it in section 3. If you can have the computers play each other enough to get a good estimate of the probability that each beats the other, you can just convert the probabilities to ratings. Or, use posterior mode.

    Lack of transitivity isn’t really a problem for rating systems in practice with humans. The simple linear model for playing strength is pretty close to reality. While lack of transitivity does occur, the players involved are probably pretty close in playing strength, and you probably won’t have enough data for the lack of transitivity to affect the ratings. There are bigger problems for rating systems.

    • David:

      I can’t speak for Murphy, but in general it’s my impression that a lot of people don’t realize that ability rating is a statistical estimation problem. People just think the ratings are there and represent some real thing, as is illustrated by Murphy’s statement that “it is a solved problem to maintain a numeric skill rating.”

      • Indeed. People think ratings are easy to calculate. Most players will believe whatever number they are told. On the other hand, I do have players contact me and tell me the rating system calculated their rating wrong. They may say that they should have gained more points or another player gained more points than them for a similar result.

        • David:

          As discussed here, I think there’s an inherent tension in any rating system, as on one hand it’s usually set up as if ratings represent a true underlying ability (“Hey, I just beat a player rated 2100!”), but on the other hand a key goal of ratings, especially at the personal level, is to track changes (improvement, ideally). Time series models such as Glickman’s address this statistically, but I think this tension it’s still there in the interpretation of the numbers.

        • I’m not sure why you think this is different from lots of other estimation problems. For example, Kalman filter models for navigation systems will have time-varying quantities that you want to estimate. (My day job involved Kalman filters for twenty-two years.)

          Mark’s model is essentially the same as the model in my paper:

          https://www.ratingscentral.com/Doc/NewTTRS.pdf

          The form of the model is pretty obvious (a normal random walk for the time varying part), but you do need to figure out the right parameters. The harder problem is how to use it in a rating system.

          In my system and Mark’s system, the rating is an estimate of the player’s current playing strength (no “tension” in the interpretation). So, you can track how players change with time, e.g.,

          https://www.ratingscentral.com/HistoryGraph.php?PlayerID=126516

          One of the complaints knowledgeable people had (have) about the USA Table Tennis rating system (which is sort of like Elo) is that it didn’t (doesn’t) accurately track the improvement of juniors. Fixing this problem was a primary goal when I developed my system (when I was on the USA Table Tennis Rating Committee).

          In 2019, I added a Poisson jump term to the Ratings Central model. This helps track improving players, and also fixes some deflation that was in the system.

          I don’t consider Elo to be a statistical system (although there are papers in the statistics journals on Elo-type systems) because the update rule is just pulled out of the air. As you noted, there is nothing in the model about how ratings change in time. So, you are left to fiddle with the parameters to try to get it to track players as they change without making the ratings too volatile for the players who are not changing.

          The main advantage of models like mine and Mark’s is that the system automatically keeps track of how much information it has on each player. So, new players and rapidly improving players don’t need to take points from established players to get their ratings to increase.

        • David:

          I agree that this comes up in other estimation problems with time-varying parameters. To me, the special feature of chess ratings is that people seem to have the attitude that your current rating is supposed to represent a summary of who you are. I don’t see this in other sports. For example, when someone’s rated #38 in the world in tennis, it’s well understood that this is their current rating, which in turn is based on their recent record. Kind of like if a baseball team has a record of 39-60, that’s understood to be the team’s record, not its underlying ability. But in chess I have the impression that people take ratings more seriously, and I perceived some of this in Murphy’s remark about skill ratings, that these are not just some sort of adjusted won-lost records but have some deeper meaning.

        • That’s the whole point of ratings instead of rankings! The rating is an estimate of your current playing strength, regardless of who else is in the ranking list.

          If I go to a tournament in another state, I can enter the rating-limited events and expect to play players of a similar level to my own. If all I knew was that I ranked #100 in Massachusetts, this wouldn’t tell me what my ranking in California should be.

          How much better is the #38 tennis player than the #100 player? The numbers don’t tell me. But, if you tell me the #38 player is rated 3100 and the #100 is rated 2900, then I know it is unlikely that the #100 will beat the #38. Actually, they are probably closer in rating than 200 points, e.g.,

          https://www.ratingscentral.com/ProPlayerList.php?PlayerGender=M

          If I go to a tournament in California, and I’m rated 1800, and I beat a 2000 player, I know that I won a match that I was expected to lose.

          Many ranking systems use time windows or crude formulas, so don’t accurately reflect how good the player is.

          > people seem to have the attitude that your current rating is supposed to > represent a summary of who you are

          Indeed. It represents what your current playing strength is. Most players don’t change their level that quickly, and there is a very wide range of levels in recreational tournaments and clubs. So, a player who was 1800 five years ago is probably still around 1800 today.

          In his book, Elo compared chess players through history. I don’t believe his ratings are nearly stable enough to make such a comparison. However, for most players, their playing strength is stable over a period of years, so their rating (if they play enough rated events) will usually accurately reflect their playing strength.

          You mentioned baseball teams. All the teams in a professional league are pretty close in level. This is very different from the chess community or the table tennis community.

          A rough rule of thumb is that 200 rating points is a different level. So, if I’m 2000, then a 1800 player won’t be competitive with me. (The Ratings Central model says that the 1800 player has a 5% chance of winning.) So, the difference between a 2500 player and a 1000 player is huge.

        • David:

          Regarding rankings vs. ratings: sure, I see what you’re saying. But I don’t think it’s just that. The number of high-stakes matches is low enough, and ability changes fast enough, that I think that even if the rankings were transformed into ratings, players and fans would recognize their temporary nature. So a rating of 3100, say, would not be interpreted as that this player is awesome in an absolute sense, but rather that the player is awesome right now, is in the middle of a hot streak, etc.

          I take your point about chess ratings, that for many players they will be stable most of the time.

        • That’s the whole point of ratings instead of rankings! The rating is an estimate of your current playing strength, regardless of who else is in the ranking list.

          I don’t think this is really true. The absolute numbers of the elo model are non-identifiable up to an arbitrary anchor point, and if communities are not sufficiently cross pollinated by the occasional exhibition map, they will have some separation.

        • Andrew,

          I’m sorry, but I’m not following what you are saying. You first said that people interpret ratings as measuring something about themselves rather than just being their latest score. (I explained why they are correct to do this.) Then you said that people would not interpret ratings this way. Are you changing what you said before? Or, do you mean if ratings were introduced into other sports?

          > The number of high-stakes matches is low enough, and ability changes fast
          > enough, that I think that even if the rankings were transformed into
          > ratings, players and fans would recognize their temporary nature.

          What sport are you talking about? Table tennis is pretty similar to tennis. Here is one of the top European table tennis players:

          https://www.ratingscentral.com/HistoryGraph.php?PlayerID=5110

          That graph covers two decades. He’s getting older, but his level has been quite stable over periods of years. He’s probably been more stable than most players, but playing strength does not change quickly. In fact, the Ratings Central model works well, so this tells us how quickly playing strength changes.

          Ratings really do measure something intrinsic to the person. I haven’t played any table tennis for a couple of years, but if I started playing again, I’d probably only be 200 points worse than when I stopped. The rating scale is huge. And, once you learn a skill, it stays with you.

          Feel free to email me directly if you’d rather discuss this off the blog.

        • somebody,

          > The absolute numbers of the Elo model are non-identifiable up to an
          > arbitrary anchor point, and if communities are not sufficiently cross
          > pollinated by the occasional exhibition map, they will have some
          > separation.

          Sure, the origin of the scale is arbitrary. But, once you pick it, the scale is determined by the probability of winning as a function of the rating difference. The way we specify the origin in practice is we specify the prior, then process the data.

          Sure, it helps if there is mixing, i.e., players playing other players in different groups and regions. But, if they don’t, you can still set the scale by setting the prior correctly.

        • David:

          Yeah, maybe what I’m saying is kind of incoherent. I’ll have to think about all of this more, I guess.

          Also, do you have answers to the 9 questions in the above post?

  6. For fun, people might recall that the originator of UNIX, Ken Thompson, with Joe Condon, spent a few years working on computer chess and his Belle software/hardware was computer chess champion for a few years.
    https://en.wikipedia.org/wiki/Belle_(chess_machine)

    Ken did an oral history on Belle for the Computer History Museum:
    https://archive.computerhistory.org/resources/access/text/2013/05/102657921-05-01-acc.pdf

    pp.19- discusses his computing-based studies of endgames, including the 50-move rule.

  7. In around 1981 I played in the US Junior (chess) Open in Crossville, TN. I had a chance to talk to several players and organizers at the event, and heard the following (I have no firsthand data to substantiate it, but it seems plausible):

    There were scarcely any tournament players in Tennessee, and then an organizer started to develop scholastic programs, which were unexpectedly very successful. So initially there were a few rated adults and a LOT of unrated kids. The only place the kids could really get points was from the adults, and there weren’t enough to go around. So the adults’ ratings all crashed, and the kids’ ratings remained very, very low. They were, presumably, fairly accurate with respect to each other (though USCF imposes a minimum rating of 100 which may have interfered with accuracy) but they were very different from those, say, in AK (where I was from).

    So the playing community consisted mostly of kids, and the kids’ ratings were mostly in the 100-500 range (rather than the expected 1000+). And any adult who ventured into TN would have his rating points drained by a cloud of young mosquitoes–which made the problem worse, because a lot of adults didn’t care for that prospect! (I didn’t have a problem at the Junior Open because I played mainly higher rated kids from out of state.)

    We see a similar effect in the Pacific Northwest where there is a scholastic-only Elo system as well as the USCF one. A player who avoids USCF tournaments and plays a lot of scholastic can be underrated by many hundreds of points, which adds a lot of noise to the system.

    So I think that community variation in the origin of the rating system scale is a real problem for USCF’s Elo implementation. There is no way to calibrate it across different communities that do not interact sufficiently.

    Personal anecdote on this: I played a pre-teen girl in a WA Women’s Championship who looked at the pairings and said to her coach, in a quavering voice, “She’s rated 1100 points higher than me! What do I do?” Her coach patted her head and said, “Play your best, that’s all you can do.” I barely managed to draw the game. Next time I saw her, just a few months later, I was rated 200 points above her instead of 1100. Admittedly, while a fairly strong player I am not at all a consistent one; but I also think this shows the noise that scholastic ratings can introduce into a situation–young rapidly improving players–which is already innately noisy.

  8. Some of these issues are difficult for any rating system. But, some are specific to the USCF’s Elo. Elo is not Bayesian, so all the things that a Bayesian system handles automatically must be done using additional rules in an Elo-type system.

Leave a Reply

Your email address will not be published. Required fields are marked *