Using Bradley Terry to model NBA Jump Ball
Posted on August 16, 2020 in NBA-Basketball
Premise and Background
I was curious whether one could use Bradley Terry to model the probability of a NBA center winning a jump ball. A Bradley-Terry model is a type of logistic regression of paired individuals. A common explanation is that it may be difficult for a person to rank many beers at a brewery, but a person could sample each beer and rank a pair of them. Then the Bradley-Terry model can be used to create a full ranking
For my case, I decided to take the 2018-2019 NBA season play by play data and look only at centers.
I decided to break the analysis into two pieces Python : Scrape the data and create a csv file of winners and losers of each jump ball R : Create the Bradley Terry Model
Process the Data
Luckily I found an archive of play-by-play data that had already been scraped from here
I then built a dictionary of each player's name to their team on the 2018-2019 roster. I noticed that this roster appeared to only be the roster at the start of the season, and did not account for trades (which we will see later).
from selenium import webdriver
import pickle
import os
import csv
team_abbrev = ['TOR', 'BOS', 'NYK', 'PHI', 'BRK',
'DEN', 'OKC', 'UTA', 'POR', 'MIN',
'MIL', 'IND', 'CHI', 'DET', 'CLE',
'LAL', 'LAC', 'PHO', 'SAC', 'GSW',
'MIA', 'ORL', 'CHO', 'WAS', 'ATL',
'HOU', 'DAL', 'MEM', 'SAS', 'NOP']
working_dir = os.getcwd()
working_dir = working_dir + "/data"
# For a given season, scrape the rosters
# https://www.basketball-reference.com/teams/NOP/2019.html
player_height = dict()
player_team = dict()
# Create webdriver
option = webdriver.ChromeOptions()
option.add_argument("—-incognito")
browser = webdriver.Chrome(executable_path='/Users/jamesli/PycharmProjects/chromedriver', chrome_options=option)
year = '2019'
for team in team_abbrev:
url = 'https://www.basketball-reference.com/teams/' + team + '/' + year + '.html'
browser.get(url)
table_rows = browser.find_element_by_id('all_roster').find_element_by_class_name('sortable').find_elements_by_xpath("tbody/tr")
for row in table_rows:
if row.get_attribute('class') != 'thead':
# Get the box score url
if len(row.find_elements_by_xpath('td[1]')) > 0:
name = row.find_element_by_xpath('td[1]').get_attribute('csk')
pos = row.find_element_by_xpath('td[2]').text
height = row.find_element_by_xpath('td[3]').text
weight = row.find_element_by_xpath('td[4]').text
dob = row.find_element_by_xpath('td[5]').text
splitname = name.split(',')
lastname = splitname[0]
firstname = splitname[-1]
abbrev_name = firstname[0] + ". " + lastname
player_team[abbrev_name] = team
player_height[abbrev_name] = height
with open(working_dir + '/team_roster' + year + '.pickle', 'wb') as handle:
pickle.dump(player_team, handle, protocol=pickle.HIGHEST_PROTOCOL)
I then built two functions, one to test whether our code could determine who won the jump ball, and another to actually write the winner to a csv file. I found that a number of cases occured where the code could not determine who won. Looking more closely into these cases, it looks like this is due to trades that occurred during the season
with open(working_dir + '/jump_ball_processed.csv', 'w') as fd:
fd.write("Winner,Loser\n")
def write_win(player1, player2, poss_player):
# Player 1 won the jump ball
if (player_team[player1] == player_team[poss_player]):
with open(working_dir + '/jump_ball_processed.csv', 'a') as fd:
fd.write(player1 + "," + player2 + '\n')
# Player 2 won the jump ball
elif (player_team[player2] == player_team[poss_player]):
with open(working_dir + '/jump_ball_processed.csv', 'a') as fd:
fd.write(player2 + "," + player1 + '\n')
else:
# Neither player won
print('Neither player won: ' + player1 + ' ' + player2 + ' Posessing Player = ' + poss_player)
def test_win(player1, player2, poss_player):
if (player_team[player1] == player_team[poss_player]):
winning_players.append(player1)
losing_players.append(player2)
elif (player_team[player2] == player_team[poss_player]):
winning_players.append(player2)
losing_players.append(player1)
I then went through the play by play data and looked for each case of a jump ball at the beginning of play or at the beginning of a OT. I then filtered to find only the centers who particpated in more than 30 jump balls and had occurences of winning and losing a jump ball. The final result was all 30 starting centers in the NBA
for file in os.listdir(working_dir):
if file.endswith(".pickle") and "team_roster" in file:
with open(working_dir + "/" + file, 'rb') as handle:
player_team = pickle.load(handle)
winning_players = list()
losing_players = list()
with open(working_dir + "/nba_18_19.csv") as fp:
reader = csv.reader(fp, delimiter=",", quotechar='"')
# next(reader, None) # skip the headers
for row in reader:
# 10th time remaining in quarter (seconds)
# 15 is play
#Jump ball: L.Aldridge vs. J.McGee (P.Mills gains possession)
if row[10] == '720' and "Jump ball" in row[15]:
# We have a jump ball at the beginning of play
play = row[15].replace('Jump ball: ', '')
arr = play.split(' vs. ')
player1 = arr[0]
arr = arr[1].split(' (')
player2 = arr[0]
poss_player = arr[1].split(' gains')[0]
test_win(player1, player2, poss_player)
if row[10] == '300' and "Jump ball" in row[15]:
# Overtime
play = row[15].replace('Jump ball: ', '')
arr = play.split(' vs. ')
player1 = arr[0]
arr = arr[1].split(' (')
player2 = arr[0]
poss_player = arr[1].split(' gains')[0]
test_win(player1, player2, poss_player)
with open(working_dir + "/nba_18_19.csv") as fp:
reader = csv.reader(fp, delimiter=",", quotechar='"')
# next(reader, None) # skip the headers
for row in reader:
if row[10] == '720' and "Jump ball" in row[15]:
# We have a jump ball at the beginning of play
play = row[15].replace('Jump ball: ', '')
arr = play.split(' vs. ')
player1 = arr[0]
arr = arr[1].split(' (')
player2 = arr[0]
poss_player = arr[1].split(' gains')[0]
if player1 in winning_players and player2 in winning_players and player1 in losing_players and player2 in losing_players:
if(winning_players.count(player1) + losing_players.count(player1) > 30) and \
(winning_players.count(player2) + losing_players.count(player2) > 30):
write_win(player1, player2, poss_player)
if row[10] == '300' and "Jump ball" in row[15]:
# Overtime
play = row[15].replace('Jump ball: ', '')
arr = play.split(' vs. ')
player1 = arr[0]
arr = arr[1].split(' (')
player2 = arr[0]
poss_player = arr[1].split(' gains')[0]
if player1 in winning_players and player2 in winning_players and player1 in losing_players and player2 in losing_players:
if(winning_players.count(player1) + losing_players.count(player1) > 30) and \
(winning_players.count(player2) + losing_players.count(player2) > 30):
write_win(player1, player2, poss_player)
Load our processed data from python into R
I loaded the jump ball data from Python. The data has 727 observations, each one corresponding to the jump-ball winner and loser
jump_ball_2018_2019 = read.csv('./jump_ball_data/jump_ball_processed.csv', header=TRUE)
head(jump_ball_2018_2019)
## Winner Loser
## 1 J. Embiid A. Horford
## 2 B. Lopez C. Zeller
## 3 A. Drummond J. Allen
## 4 A. Davis C. Capela
## 5 M. Turner M. Gasol
## 6 M. Gortat N. Jokić
Bradley Terry Model
Then I created a bradley-terry model, using Anthony Davis as the reference category
library(BradleyTerry2)
result <- rep(1, length(jump_ball_2018_2019$Winner))
BTmodel <- BTm(result, Winner, Loser, br = TRUE, data = jump_ball_2018_2019, refcat = 'A. Davis')
summary(BTmodel)
##
## Call:
## BTm(outcome = result, player1 = Winner, player2 = Loser, refcat = "A. Davis",
## data = jump_ball_2018_2019, br = TRUE)
##
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## ..A. Drummond 0.49125 0.44488 1.104 0.269492
## ..A. Horford -0.79783 0.45899 -1.738 0.082170 .
## ..B. Lopez 0.40437 0.43270 0.935 0.350035
## ..C. Capela -0.50013 0.41768 -1.197 0.231152
## ..C. Zeller -2.11943 0.58908 -3.598 0.000321 ***
## ..D. Ayton -1.70565 0.48127 -3.544 0.000394 ***
## ..D. Dedmon -0.41477 0.48310 -0.859 0.390583
## ..D. Jordan -0.41027 0.43639 -0.940 0.347145
## ..H. Whiteside -0.42475 0.47221 -0.899 0.368394
## ..I. Zubac -1.47032 0.59831 -2.457 0.013992 *
## ..J. Allen 0.06548 0.43464 0.151 0.880249
## ..J. Embiid -0.45585 0.44603 -1.022 0.306779
## ..J. McGee 0.39295 0.46956 0.837 0.402677
## ..J. Nurkić 0.38518 0.45037 0.855 0.392403
## ..K. Durant -0.56366 0.49581 -1.137 0.255608
## ..K. Towns -0.23520 0.42324 -0.556 0.578400
## ..L. Aldridge -1.39566 0.48918 -2.853 0.004330 **
## ..M. Gasol -1.04732 0.44163 -2.372 0.017716 *
## ..M. Gortat 0.41836 0.49161 0.851 0.394767
## ..M. Turner -2.27198 0.53542 -4.243 2.2e-05 ***
## ..N. Jokić -1.05886 0.46413 -2.281 0.022526 *
## ..N. Vučević -1.13943 0.44629 -2.553 0.010676 *
## ..R. Gobert 0.21824 0.41909 0.521 0.602550
## ..R. Lopez 0.18075 0.57426 0.315 0.752954
## ..S. Adams -0.10006 0.42736 -0.234 0.814882
## ..S. Ibaka -0.72202 0.46718 -1.545 0.122230
## ..T. Bryant -1.51415 0.53497 -2.830 0.004650 **
## ..T. Thompson -1.28290 0.53191 -2.412 0.015871 *
## ..W. Carter -1.45579 0.53086 -2.742 0.006101 **
## ..W. Cauley-Stein -0.90601 0.42824 -2.116 0.034372 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 903.36 on 727 degrees of freedom
## Residual deviance: 834.20 on 697 degrees of freedom
## Penalized deviance: 771.7662
## AIC: 894.2
Bradley Terry Model Interpretation
To determine the probability of Player 1 winning a jump ball vs Player 2 in the NBA, I use the formula where
exp(beta1) / (exp(beta1) + exp(beta2))
In this case, the reference category is Anthony Davis, so I calculated all of the probabilities of Anthony Davis vs the other centers in the league. The results below show that Mylers Turner (9% chance to beat AD) was the worst jump ball player and Andre Drummond (62% chance to beat AD) was the best jump ball in the NBA in 2018-2019 season.
I also performed a likelihood ratio test of the model vs the null model (each center has equal chance of winning the jump ball). The likelihood ratio test reject the null hypothesis and therefore the Bradley Terry model is an improved model compared to the null model.
exp(BTabilities(BTmodel)[,1])/(1+exp(BTabilities(BTmodel)[,1]))
## M. Turner C. Zeller D. Ayton T. Bryant I. Zubac
## 0.09346993 0.10722246 0.15372824 0.18032500 0.18689387
## W. Carter L. Aldridge T. Thompson N. Vučević N. Jokić
## 0.18911184 0.19850627 0.21705713 0.24242497 0.25752774
## M. Gasol W. Cauley-Stein A. Horford S. Ibaka K. Durant
## 0.25973975 0.28781698 0.31048989 0.32694920 0.36270135
## C. Capela J. Embiid H. Whiteside D. Dedmon D. Jordan
## 0.37751125 0.38797152 0.39538103 0.39776841 0.39884785
## K. Towns S. Adams A. Davis J. Allen R. Lopez
## 0.44146843 0.47500604 0.50000000 0.51636417 0.54506388
## R. Gobert J. Nurkić J. McGee B. Lopez M. Gortat
## 0.55434387 0.59512271 0.59699267 0.59973737 0.60309034
## A. Drummond
## 0.62040007
1 - pchisq(BTmodel$null.deviance - BTmodel$deviance, BTmodel$df.null - BTmodel$df.residual)
## [1] 6.270164e-05
Conclusion
Note that I had also seen some analysis done before using a ELO model here
Both this model as well as the ELO model identified Andre Drummond, Rudy Gobert and Javale McGee as some of the best jump ball players.
In the future, I would work to resolve the data where players were traded as well as incorporate more season's of data to improve the model.