Why all your career WAR leader boards are wrong.
To be a baseball fan is to be obsessed with numbers. 3000, 60, 500, 300, 3.00, .300, we all know what these numbers represent. (And what is it with the number three that makes it come up so much?) But now we have a whole new set of numbers less well defined by arbitrary benchmarks. WAR -€” both the f and b versions -€” xwOBA, FIP, xFIP, xERA, WPA. The column names of baseball leader boards look like the product of spoonfuls of alphabet soup drawn at random. But, as fans of the sport, we've grown custom to these new names and their meanings.
One such statistic we've now embraced is WAR -€” Wins Above Replacement. It is the one reductive number that captures everything: fielding, hitting, pitching, replacement level, positional value, park factors. Everything from trade to MVP discussions to Hall of Fame ballots now revolve around WAR. Unless the WAR numbers are close, often not much is even made of the context behind these numbers -€” even across generations. And that's wrong.
Major League Baseball is now about 150 years old. The game has been through so many changes and has so much history that Ken Burns can make a 10-part documentary about it and still hardly scratch the surface. Gloves have changed from things resembling biker gloves and oven mitts to thin leather wrappings for each finger to what we see today. And despite these evolutions, it is common practice to rank players separated by two-lifetimes by a single number -€” career WAR.
We know that hits, home runs, strike outs, or pitcher wins were all subject to contexts of the era in which they came. Strike outs and homers are easier to come by today, while hits and pitcher wins are harder to find now compared to years past. But what about WAR? The various adjustments and normalizations done across the league should make it less prone to changes over time, but how true is that?
I decided to go dig into the historical data for fWAR (fangraphs WAR) for hitters and tackle this question by inspecting the per-season variance in fWAR among qualified players. This represents 15,718 qualified player seasons across 151 seasons. The variance will tell us just how much spread there was in per-season WAR. The higher that number, the wider the distribution, meaning more players at the lower or higher end of the WAR spectrum. I've plotted this variance in the graph below.
This plot clearly shows that fWAR variance by year is not consistent across time -€” it also fails formal tests of equal variance, such as the Levene Test. The lack of equal variance makes our method of adding up season-by-season WAR totals to create an all-time ranking invalid.
Let's use a thought example for why both the mean -€” which is pegged by replacement value and is equal across eras — but also variance matters in judging players across eras with unequal variance. Imagine you have two lakes stocked with fish and you're going to having a fishing contest judged by the total poundage of fish someone catches over a day. Both lakes have been stocked with fish that have the same average weight, let's say four pounds. But, for whatever reason, the variance in fish weight is higher in second of the two lakes. Let's say one standard deviation (the square root of the variance) in lake A is just 0.5 pounds, while lake B is 1.0 pounds. Below is a randomly permutated distribution of the fish weight in the two lakes.
Not thinking about this, we randomly assign fishers to each lake and let them start the competition. Now, imagine you have two highly skilled fisherman, but in different lakes. Each fisherman easily catches a lot of fish and consistently catches fish above the lake's average weight. Because the fisherman in lake B has a higher weight distribution to draw from he has an advantage in a competition where the goal is to accumulate the highest total fish weight. To illustrate this, I ran a pseudo-competition 100 times. In each competition, both of these two great fishermen only catch fish above the average weight and they always catch 20 fish. Over this 100-pseudo-competition sample, the fisherman in lake A averages 88.4 pounds of fish, while the fisherman in lake B averages 95.7 pounds of fish. In fact, over these 100 competitions the fisher in lake A never wins.
So, what does this mean for baseball? It means that if you're just adding up season-by-season WAR, and season-by-season WAR has an unequal variance across time, then above average players from higher variance eras will seem better than their low-variance-era peers.
These types of situations aren't new to statisticians. Statisticians have developed models to equalize variance by essentially shrinking the variance into a standard range. This would cause the very high -€” or low -€” values in high variance eras to get ‘shrunk' towards the mean.
I'm not going to go through a full variance adjustment in that fashion here, but one quick and dirty way to normalize for unequal variance is to divide single season WAR for each player by the standard deviation of the WAR distribution for that year, which is often called a Z-score. What this gives you is a measure of how many standard deviations away from the mean a player's performance was. This generally works better than forcing distributions into a particular range or going by a strict percentile system (ie, top player is always capped at 10 WAR or we just add up percentile ranks). This is because occasionally a truly amazing player, like Babe Ruth, goes so far off the scale in a high variance era, that we don't want to punish him by limiting him to the same value as the best player in any other seasons.
Below is a table of the single best seasons by this new Z-score of fWAR -€” which I will of course call zfWAR!
|
Name |
Season |
fWAR |
zfWAR |
|
Babe Ruth |
1923 |
15 |
5.9 |
|
Babe Ruth |
1926 |
12 |
5.8 |
|
Barry Bonds |
2004 |
11.9 |
5.6 |
|
Babe Ruth |
1921 |
13.9 |
5.6 |
|
Barry Bonds |
2002 |
12.7 |
5.6 |
|
Honus Wagner |
1908 |
11.8 |
5.5 |
|
Willie Mays |
1962 |
10.5 |
5.5 |
|
Barry Bonds |
2001 |
12.5 |
5.4 |
|
Ty Cobb |
1911 |
11 |
5.3 |
|
Ty Cobb |
1915 |
9.8 |
5.3 |
|
Cal Ripken |
1991 |
10.6 |
5.2 |
|
Babe Ruth |
1920 |
13.3 |
5.2 |
|
Mickey Mantle |
1956 |
11.5 |
5.2 |
|
Ty Cobb |
1917 |
11.5 |
5.2 |
|
Rogers Hornsby |
1925 |
10.8 |
5.2 |
|
Ted Williams |
1946 |
11.8 |
5.1 |
|
Honus Wagner |
1907 |
9.2 |
5.1 |
|
Ted Williams |
1947 |
10.5 |
5.1 |
|
Willie Mays |
1965 |
10.7 |
5.1 |
This is still very much a list of the well-known best seasons ever, which serves as a nice sanity check. We aren't seeing random seasons with relatively low fWAR totals get divided by small variance to yield a big zfWAR number. However, this list is a bit less dominated by Ruth and Bonds, and we see things like Mays show up twice in the top 20, when his best fWAR season was only ranked 30th. We also get Cal Ripkin in the list from his 1991 season, which I think is great given the absolute dearth of players between Mays/Yaz and Bonds, chronologically, in the top WAR lists.
Finally, I've added all these zfWAR totals up to give our career zfWAR rankings after 1900. Below are the top 50.
|
Rank |
Name |
WAR |
zWAR |
|
1 |
Barry Bonds |
150.8 |
70.4 |
|
2 |
Ty Cobb |
138 |
65.6 |
|
3 |
Willie Mays |
143.7 |
65.5 |
|
4 |
Babe Ruth |
150.6 |
62.3 |
|
5 |
Tris Speaker |
130.6 |
61.0 |
|
6 |
Hank Aaron |
129 |
59.2 |
|
7 |
Honus Wagner |
121.1 |
58.1 |
|
8 |
Rogers Hornsby |
123.3 |
53.7 |
|
9 |
Eddie Collins |
112.7 |
53.3 |
|
10 |
Mike Schmidt |
102.9 |
52.1 |
|
11 |
Stan Musial |
116.8 |
51.7 |
|
12 |
Alex Rodriguez |
110.8 |
51.0 |
|
13 |
Ted Williams |
113.1 |
49.0 |
|
14 |
Lou Gehrig |
116.2 |
48.5 |
|
15 |
Rickey Henderson |
95.5 |
47.5 |
|
16 |
Mel Ott |
108.1 |
46.3 |
|
17 |
Frank Robinson |
100.4 |
46.0 |
|
18 |
Joe Morgan |
94 |
44.8 |
|
19 |
Carl Yastrzemski |
93.7 |
44.5 |
|
20 |
Cal Ripken |
88.9 |
43.9 |
|
21 |
Eddie Mathews |
95.9 |
43.1 |
|
22 |
Mickey Mantle |
95.5 |
42.1 |
|
23 |
Albert Pujols |
87 |
41.4 |
|
24 |
Wade Boggs |
81.4 |
40.9 |
|
25 |
Jimmie Foxx |
95 |
40.6 |
|
26 |
George Brett |
81.3 |
40.6 |
|
27 |
Mike Trout |
74.9 |
38.3 |
|
28 |
Pete Rose |
79.5 |
38.1 |
|
29 |
Nap Lajoie |
78.5 |
37.3 |
|
30 |
Jeff Bagwell |
80 |
37.3 |
|
31 |
Brooks Robinson |
78.6 |
36.9 |
|
32 |
Adrian Beltre |
77.7 |
36.9 |
|
33 |
Eddie Murray |
73 |
36.6 |
|
34 |
Reggie Jackson |
75.3 |
36.4 |
|
35 |
Joe DiMaggio |
78.3 |
34.5 |
|
36 |
Derek Jeter |
73.9 |
34.3 |
|
37 |
Ken Griffey Jr. |
72.3 |
33.8 |
|
38 |
Miguel Cabrera |
68.7 |
33.5 |
|
39 |
Sam Crawford |
68.2 |
33.3 |
|
40 |
Ron Santo |
71.6 |
33.1 |
|
41 |
Robin Yount |
65.8 |
33.1 |
|
42 |
Rafael Palmeiro |
69.8 |
32.7 |
|
43 |
Gary Carter |
64.6 |
32.5 |
|
44 |
Charlie Gehringer |
76.8 |
32.3 |
|
45 |
Frank Thomas |
67.2 |
32.1 |
|
46 |
Roberto Clemente |
68.6 |
32.0 |
|
47 |
Chipper Jones |
70.6 |
31.9 |
|
48 |
Rod Carew |
65.7 |
31.8 |
|
49 |
Ozzie Smith |
62.7 |
31.6 |
There you have it, your variance-normalized career WAR leaderboard.
