By Andrew Cohen
With the conclusion of the last tennis major of the year, the US Open, and the return of football on Sundays, the sport has been returned to its status as an afterthought for American sports fans. Interestingly, for those who look at sports from a quantitative point of view, tennis has also been an afterthought. While plenty of other sports have experienced statistical revolutions, there has been very little in the way of analytical research on pro tennis. With this post, I aim to do my part in changing that status quo.
The world rankings administered by the men’s ATP Tour are used by players and fans around the globe to determine player talent and measure relative success. Behind the world rankings lies a complicated system that rewards players with points for winning matches on a scale relative to the significance of the match. Ideally, the ranking system assigns points in accordance with overall player talent. However, factors such as injuries and playing schedules may not make this the case. Match statistics do exist that attempt to quantify talent, such as how often a player aces his opponent or wins in break point (“clutch”) situations. If talent truly is reflected through the rankings, one would hypothesize that better statistics would result in a higher ranking. Is this the case? Using multiple linear regression, I seek to assess the strength of five match statistics in their ability to predict ATP ranking points.
I took match statistic data from tennisinsight.com and looked at the current top 100 players as of the week of April 12th. The five variables studied to predict ranking points were: number of aces per individual service game (aces per game), the percentage of times the player won his service game (service hold %), the percentage of points the player won when not serving (return points won %), the percentage of points won when, on his serve, his opponent had game point (break points saved %), and the percentage of points won when, on his opponent serve, he had game point (break points won %). Because of the exponential nature of the ranking system (points values increase exponentially as players progress through tournaments), the ranking points response variable is largely right skewed and thus assessed after a logarithmic transformation. All five variables were assessed at the significant predictor level of P<0.05.
Service hold % is the most statistically significant predictor of the two variables (it had a larger t-statistic). As we can see from the graph below that specifically examines service hold %, the tennis players that can hold serve sit atop the world rankings.
The variables found significant in this study are relatively unsurprising. While some players may not strictly adhere to these rules (like the big serving but low ranked Ivo Karlovic), they are the exceptions. For the most part, the top servers and returners are the big names that make deep runs into the slams.
Perhaps the most interesting result of this study is that the other three variables studied (aces per game, break hold %, break points won %) were not significant predictors of world ranking. It is certainly surprising that player performances in break point situations are insignificant determiners of world ranking. If you were to look at who currently leads the ATP tour early on this season in break point stats, names such as Rafael Nadal, Roger Federer, and Andy Roddick would frequent the list. Break points are considered the “clutch” moments in tennis, and the players that win them often attain insurmountable advantages in matches. One would assume that winning or fighting off break points (representative of break points won % and break points saved %) would result in match wins which would increase ranking points. As an avid Andy Roddick fan, I figure I will still have a tough time shaking off his errant shots on break points, despite the knowledge of this study.
It is slightly less surprising that aces per game are not significant predictors of ranking points. While aces a surefire way of winning points, it is a well known fact that the best servers in tennis are not always the best players. Having an effective serve helps, but often the best servers are powerful players who are usually larger and therefore lack quickness and other attributes required for success in tennis. A look at the meager top 10 list in this statistical category provides confirmation to the finding (ATP ranking in parenthases): Ivo Karlovic (28), John Isner (22), Ivan Ljubicic (14), Sam Querrey (25), Andy Roddick (7), Ernests Gulbis (44), Michael Llodra (66), Jo-Wilfried Tsonga (10), Fliciano Lopez (35), Rajeev Ram (95).
Perhaps the largest weakness and limitation of this study is its inability to prove causation. This is because the study is simply observational. The significant predictor variables in this study are simply associated with player ranking points. Causation may exist for one or more of these variables, but cannot be proven. A randomized experiment or other studies that confirm the association found are needed to prove causation. Although only two of the five predictor variables are significant, all five variables are correlated with each other to varying degrees and confounding variables may exist such as a particular skill or trait that influences multiple predictors. It should also be noted that the two significant predictor variables represent proxies for tennis a pro’s skillset, and they should not be treated as exclusively predicting future performance. Another potential shortcoming of this study is that, other than aces per game, the predictor variables rely heavily on the quality of the player’s opponent. Therefore, a player who does not play the same schedule difficulty as the top players can accumulate inflated statistics.
Future studies can be conducted that examine other predictor variables, specifically ones that look at player attributes rather than match statistics. The players that constitute upper tier of today’s tennis rankings are much taller and more highly concentrated in Europe than in past years. It would be interesting to see if any of these categorical variables were predictive of ranking points. Logistic regression instead of multiple linear regression would probably provide a better study that could examine the likelihood of winning or advancing in tournaments.