The “Elo” rating system is a method most famous for ranking chess players, but which has now spread to many other sports and games.
How Elo works is like this: when you start out in competitive chess, the federation assigns you an arbitrary rating — either a standard starting rating (which I think is 1200), or one based on an estimate of your skill. Your rating then changes as you play.
What I gather from Wikipedia is that “master” starts at a rating of about 2300, and “grandmaster” around 2500. To get from the original 1200 up to the 2300 level, you just start winning games. Every game you play, your rating is adjusted up or down, depending on whether you win, lose, or draw. The amount of the adjustment depends on the difference in skill between you and your opponent. Elo calculates an estimate of the odds of winning, calculated from your rating and your opponent’s rating, and the loser “pays” points to the winner. So, the better your opponents, the more points you get for defeating them.
The rating is an estimate of your skill, a “true talent level” for chess. It’s calibrated so that every 400-point difference between players is an odds ratio of 10. So, when a 1900-rated player, “Ann”, faces a 1500-rated player, “Bob,” her odds of winning are 10:1 (.909). That means that if the underdog, Bob, wins, he’ll get 10 times as many points as Ann will get if she wins.
How many points, exactly? That’s set by the chess federation in an attempt to get the ratings to converge on talent, and the “400-point rule,” as quickly and accurately as possible. The idea is that the less information you have about the players, the more points you adjust by, because the result carries more weight towards your best estimate of talent.
For players below “expert,” the adjustment is 32 times the difference from expectation. For expert players, the adjustment is only 24 points per win, and, at the master level and above, it’s 16 points per win.
If Bob happens to beat Ann, he won 1.00 games when the expectation was that that he’d win only 0.09. So, Bob exceeded expectations by 0.91 wins. Multiply by 32, and you get 29 points. That means Bob’s rating jumps from 1500 to 1529, while Ann drops from 1900 to 1871.
If Ann had won, she’d claim 3 points from Bob, so she’d be at 1903 and Bob would wind up at 1497.
FiveThirtyEight recently started using Elo for their NFL and NBA ratings. It’s also used by my Scrabble app, and the world pinball rankings, and other such things. I haven’t looked it up, but I’d be surprised if it weren’t used for other games, too, like Backgammon and Go.
For the record, I’m not an expert on Elo, by any means … I got most of my understanding from Wikipedia, and other internet sources. And, a couple of days ago, Tango posted a link to an excellent article by Adam Dorhauer that explains it very well.
Despite my lack of expertise, it seems to me that these properties of Elo are clearly the case:
1. Elo ratings are only applicable to the particular game that they’re calculated from. If you’re a 1800 at Chess, and I’m a 1600 at Scrabble, we have no idea which one of us would win at either game.
2. The range of ELO ratings varies between games, depending on the range of talent of the competitors, but also on the amount of luck inherent to the sport. If the best team in the NBA is (say) an 8:1 favorite against the worst team in the league, it must be rated 361 Elo points better. (That’s because 10 to the power of (361/400) equals 8.) But if the best team in MLB is only a 2:1 favorite, it has to be rated only 120 points better.
Elo is an estimate of odds of winning. It doesn’t follow, then, that a 1800 rating in one sport is comparable to a 1800 rating in another sport. I’m a better pinball player than I am a Scrabble player, but my Scrabble rating is higher than my pinball rating. That’s because underdogs are more likely to win at pinball. I have a chance of beating the best pinball player in the world, in a single game, but I’d have no chance at all against a world-class Scrabble player.
In other words: the more luck inherent in the game, the tighter the range (smaller the standard deviation) of Elo ratings.
3. Elo ratings are only applicable within the particular group that they’re applied to.
Last March, before the NCAA basketball tournament, FiveThirtyEight had Villanova with an Elo rating of 2045. Right now, they have the NBA’s Golden State Warriors with a rating of 1761.
Does that mean that Villanova was actually a better basketball team than Golden State? No, of course not. Villanova’s rating is relative to its NCAA competition, and Golden State’s rating is relative to its NBA competition.
If you took the ratings at face value, without realizing that, you’d be projecting Villanova as 5:1 favorites over Golden State. In reality, of course, if they faced each other, Villanova would get annihilated.
OK, this brings me to a study I found on the web (hat tip here). It claims that women do worse in chess games that they play against men rather than against women of equal skill. The hypothesis is, women’s play suffers because they find men intimidating and threatening.
(For instance: “Girls just don’t have the brains to play chess,” (male) grandmaster Nigel Short said in 2015.)
In an article about the paper, co-author Maria Cubel writes:
“These results are thus compatible with the theory of stereotype threat, which argues that when a group suffers from a negative stereotype, the anxiety experienced trying to avoid that stereotype, or just being aware of it, increases the probability of confirming the stereotype.
“As indicated above, expert chess is a strongly male-stereotyped environment.“… expert women chess players are highly professional. They have reached a high level of mastery and they have selected themselves into a clearly male-dominated field. If we find gender interaction effects in this very selective sample, it seems reasonable to expect larger gender differences in the whole population.”
Well, “stereotype threat” might be real, but I would argue that you don’t actually have evidence of it in this chess data. I don’t think the results actually mean what the authors claim they mean.
The authors examined a large database of chess results, and selected all players with a rating of at least 2000 (expert level) who played at least one game against an opponent of each of the two sexes.
After their regressions, the authors report,
“These results indicate that players earn, on average and ceteris paribus, about 0.04 fewer points [4 percentage points of win probability] when playing against a man as compared to when their opponent is a woman. Or conversely, men earn 0.04 points more when facing a female opponent than when facing a male opponent. This is a sizable effect, comparable to women playing with a 30 Elo point handicap when facing male opponents.”
The authors did control for Elo rating, of course. That was especially important because the women were, on average, less skilled than the men. The average male player in the study was rated at 2410, while the average female was only 2294. That’s a huge difference: if the average man played the average woman, the 116-point spread suggests the man would have a .661 winning percentage — roughly, 2:1 odds in favor of the man.
Also, there were many more same-sex matches in the database than intersex matches. There are two reasons for that. First, many tournaments are organized by ranking; since there are many more men, proportionally, in the higher ranks, they wind up playing each other more often. Second, and probably more important, there are many women’s tournaments and women’s-only competitions.
So, now we see the obvious problem with the study, why it doesn’t show what the authors think it shows.
It’s the Villanova/Golden State situation, just better hidden.
The men and women have different levels of ability — and, for the most part, their ratings are based on play within their own group.
That means the men’s and women’s Elo ratings aren’t comparable, for exactly the same reason an NCAA Elo rating isn’t comparable to an NBA Elo rating. The women’s ratings are based more on their performance relative to the [less strong] women, and the men’s ratings more on their performance relative to the [stronger] men.
Of course, the bias isn’t as severe in the chess case as the basketball case, because the women do play matches against men (while Villanova, of course, never plays against NBA teams). Still, both groups played predominanly within their sex — the women 61 percent against other women, and the men 87 percent against other men.
So, clearly, there’s still substantial bias. The Elo ratings are only perfectly commensurable if the entire pool can be assumed to have faced a roughly equal caliber of competition. A smattering of intersex play isn’t enough.
Villanova and Golden State would still have incompatible Elos even if they played, say, one out of every five games against each other. Because, then, for the rest of their games, Villanova would go play teams that are 1500 against NCAA competition, and Golden State would go play teams that are 1500 against NBA competition, and Villanova would have a much easier time of it.
Having said that … if you have enough inter-sex games, the ratings should still work themselves out.
Because, the way Elo works, points can neither be created nor destroyed. If women play only women, and men play only men, on average, they’ll keep all the ratings points they started with, as a group. But if the men play even occasional games against the women, they’ll slowly scoop up ratings points from the women’s side to the men’s side. All that matters is *how many* of those games are played, not *what proportion*. The male-male and female-female games don’t make a huge difference, no matter how many there are.
The way Elo works, overrated players “leak” points to underrated players. No matter how wrong the ratings are to start, play enough games, and you’ll have enough “leaks” for the ratings all converge on accuracy.
Even if 99% of women’s games are against other women, eventually, with enough games played, that 1% can add up to as many points as necessary, transferred from the women to the men, to make things work out.
So, do we have enough games, enough “leaks”, to get rid of the bias?
Suppose both groups, the men and the women, started out at 1200. But the men were better. They should have been 1250, and the women should have been 1150. The woman/woman games and man/man games will keep both averages at 1200, so we can ignore those. But the woman/man games will start “leaking” ratings points to the men’s side.
Are there enough woman/man games in the database that the men could unbias the women’s ratings by capturing enough of their ratings points?
In the sample, there were 5,695 games by those woman experts (rating 2000+) who played at least one man. Of those games, 61 percent were woman against women. That leaves 2,221 games where expert women played (expert or inexpert) men.
By a similar calculation, there were 2,800 games where expert men played (expert or inexpert) women.
There’s probably lots of overlap in those two sets of games, where an expert man played an expert woman. Let’s assume the overlap is 1,500 games, so we’ll reduce the total to 3,500.
How much leakage do we get in 3,500 games?
Suppose the men really are exactly 116 points better in talent than the women, like their ratings indicate — which would be the case if the leakage did, in fact, take care of all the bias.
Now, consider what would have happened if there were no leakage. If the sexes played only each other, the women would be overrated by 116 points (since they’d have equal average ratings, but the men would be 116 points more talented).
Now, introduce intersex games. The first time a woman played a man, she’d be the true underdog by 116 points. Her male opponent would have a .661 true win probability, but treated by Elo as if he only had .500. So, the male group would gain .161 wins in expectation on that game. At 24 points per win, that’s 3.9 points.
After that game, the sum of ratings on the woman’s side drops by 3.9 points, so now, the women won’t be quite as overrated, and the advantage to the men will drop. But, to be conservative, let’s just keep it at 3.9 points all the way through the set of 3,500 games. Let’s even round it to 4 points.
Four points of leakage, multiplied by 3,500 games, is 14,000 ratings points moving from the women’s side to the men’s side.
There were about 2,000 male players in the study, and 500 female players. Let’s ignore their non-expert opponents, and assume all the leakage came from these 2,500 players.
That means the average female player would have (legitimately) lost 28 points due to leakage (14,000 divided by 500). The average male player would gain 7 points (14,000 divided by 2000).
So, that much leakage would have cut the male/female ratings bias by 35 points.
But, since we started the process with a known 116 points of bias, we’re left with 81 points still remaining! Even with such a large database of games, there aren’t enough male/female games to get rid of more than 30 percent of the Elo bias caused by unbalanced opponents.
If the true bias should be 81 points, why did the study find only 30?
Because the sample of games in the study isn’t a complete set of all games that went into every player’s rating. For one thing, it’s just the results of major tournaments, the ones that were significant enough to appear in “The Week in Chess,” the publication from which the authors compiled their data. For another thing, the authors used only 18 months worth of data, but most of these expert players have been in playing chess for years.
If we included all the games that all the players ever played, would that be enough to get rid of the bias? We can’t tell, because we don’t know the number of intersex games in the players’ full careers.
We can say hypothetically, though. If the average expert played three times as many games as logged in this 18-month sample, that still wouldn’t be enough — it would only cover be 105 of the 116 points. Actually, it would be a lot less, because once the ratings start to become accurate, the rate of correction decelerates. By the time half the bias is covered, the remaining bias corrects at only 2 points per between-sex game, rather than 4.
Maybe we can do this with a geometric argument. The data in the sample reduced the bias from 116 to 81, which is 70 percent of the original. So, a second set of data would reduce the bias to 57 points. A third set would reduce it to 40 points. And a fourth set would reduce it to 28 points, which is about what the study found.
So, if every player in this study actually had four times as many man vs. woman games as were in this database, that would *still* not be enough to reduce the bias below what was found in the study.
And, again, that’s conservative. It assumes the same players in all four samples. In real life, new players come in all the time, and if the new males tend to be better than the new females, that would start the bias all over again.
So, I can’t prove, mathematically, that the 30-point discrepancy the authors found is an expected artifact of the way the rating system works. I can only show why it should be strongly suspected.
It’s a combination of the fact that, for whatever reason, the men are stronger players than the women, and, again for whatever reason, there are many fewer male-female games than you need for the kind of balanced schedule that would make the ratings comparable.
And while we can’t say for sure that this is the cause, we CAN say — almost prove — that this is exactly the kind of bias that happens, mathematically, unless you have enough male-female games to wash it out.
I think the burden is on the authors of the study to show that there’s enough data outside their sample to wash out that inherent bias, before introducing alternative hypotheses. Because, we *know* this specific effect exists, has positive sign, depends on data that’s not given in the study, and could plausibly exactly the size of the observed effect!
(Assuming I got all this right. As always, I might have screwed up.)
So I think there’s a very strong case that what this study found is just a form of this “NCAA vs. NBA” bias. It’s an effect that must exist — it’s just the size that we don’t know. But intuitive arguments suggest the size is plausibly pretty close to what the study found.
So it’s probably not that women perform worse against men of equal talent. It’s that women perform worse against men of equal ratings.
Peter Backus, Maria Cubel, Matej Guid, Santiago Sanches-Pages, Enrique López Mañas: Gender, Competition and Performance: Evidence from Real Tournaments.