By John Ezekowitz
(Ed Note: Please use our updated model and predictions for 2012, found here)
This time last year, I attempted to create a logistic model that would predict those first round NCAA Tournaments that everyone loves so much. The model was not bad: it correctly identified Murray State over Vanderbilt as a good choice and suggested St. Mary’s run to the Sweet 16, but this year, I thought I could make some significant improvements.
So I started from scratch this year, culling data from every first round game from the 2004 through 2010 Tournaments. I decided to focus on games with at least a 5-seedline differential. 3 seeds against 14 seeds through 6 seeds against 11 seeds were the “upsets” I examined. Sooner rather than later (perhaps even this year), another 15 will beat a 2 and a 16 seed will finally beat a 1 seed, but for this analysis, I chose to focus on upsets I had data on. This left me with 224 teams and 112 games in my dataset.
The major improvement I have made from last year is that the model now accounts for strength of opponent. In addition to including the Four Factors of each team in the regression, I also included the Four Factors of their opponent. This allowed me to account for interactions between team and opponent strength and weaknesses, better modeling the matchup aspect of upsets.
In the initial regression, I also included team and opponent Pythagorean Strength of Schedule and team and opponent consistency. Consistency, derived by Ken Pomeroy, measures how consistent a team’s performance is by looking at the (adjusted) standard deviation of their scoring margins in games. Interestingly, none of the shooting percentage stats (effective field goal percentage for and against) were significant
predictors of upsets. Neither was free throw rate or seed. The best predictors were turnover rates, rebounding rates, and Strength of Schedule. Finally, I used a bootstrapping procedure which essentially repeatedly randomly takes a sample of the data and uses the model to predict which of that sample would win. Since STATA already knows the results of the data, it can assess how strong the model is. This gives the model better external validity. The full model data is to the left.
So now the big reveal: which teams are most likely to pull upsets? Here is the full table:
As you can see, these numbers differ substantially from what other projection systems (and the lines) have projected. Marquette is a team that is very good in turnover differential (forcing turnovers while not committing them) and has played a strong schedule. Xavier, on the other hand, is probably overseeded and has not played as strong a schedule. Additionally, Georgetown looks like a weak 6 seed no matter who they face: VCU or USC.
This is because those systems look at the team as a whole over the course of the year and also account for factors like injuries, geography, etc that I do not take into account here. Thus this model may underrate Gonzaga and Wofford as St John’s (DJ Kennedy) and BYU (Brandon Davies) will be without key players. But while those systems may be better in the long run, this model is specifically geared towards predicting first round upsets. These teams are most directly comparable to teams who played similar games in recent NCAA Tournaments.
The true chance for upsets probably lies somewhere in between the two models, but I certainly think this approach adds value. Please leave any thoughts for improvement or questions in the comment section, and good luck bracketing.