Using Bradley Terry to model NBA Jump Ball

Posted on August 16, 2020 in NBA-Basketball

Premise and Background

I was curious whether one could use Bradley Terry to model the probability of a NBA center winning a jump ball. A Bradley-Terry model is a type of logistic regression of paired individuals. A common explanation is that it may be difficult for a person to rank many beers at a brewery, but a person could sample each beer and rank a pair of them. Then the Bradley-Terry model can be used to create a full ranking

For my case, I decided to take the 2018-2019 NBA season play by play data and look only at centers.

I decided to break the analysis into two pieces Python : Scrape the data and create a csv file of winners and losers of each jump ball R : Create the Bradley Terry Model

Process the Data

Luckily I found an archive of play-by-play data that had already been scraped from here

I then built a dictionary of each player's name to their team on the 2018-2019 roster. I noticed that this roster appeared to only be the roster at the start of the season, and did not account for trades (which we will see later).

from selenium import webdriver
import pickle
import os
import csv

team_abbrev = ['TOR', 'BOS', 'NYK', 'PHI', 'BRK',
               'DEN', 'OKC', 'UTA', 'POR', 'MIN',
               'MIL', 'IND', 'CHI', 'DET', 'CLE',
               'LAL', 'LAC', 'PHO', 'SAC', 'GSW',
               'MIA', 'ORL', 'CHO', 'WAS', 'ATL',
               'HOU', 'DAL', 'MEM', 'SAS', 'NOP']

working_dir = os.getcwd()
working_dir = working_dir + "/data"

# For a given season, scrape the rosters
# https://www.basketball-reference.com/teams/NOP/2019.html
player_height = dict()
player_team = dict()
# Create webdriver
option = webdriver.ChromeOptions()
option.add_argument("—-incognito")

browser = webdriver.Chrome(executable_path='/Users/jamesli/PycharmProjects/chromedriver', chrome_options=option)
year = '2019'
for team in team_abbrev:
    url = 'https://www.basketball-reference.com/teams/' + team + '/' + year + '.html'
    browser.get(url)
    table_rows = browser.find_element_by_id('all_roster').find_element_by_class_name('sortable').find_elements_by_xpath("tbody/tr")
    for row in table_rows:
        if row.get_attribute('class') != 'thead':
            # Get the box score url
            if len(row.find_elements_by_xpath('td[1]')) > 0:
                name = row.find_element_by_xpath('td[1]').get_attribute('csk')
                pos = row.find_element_by_xpath('td[2]').text
                height = row.find_element_by_xpath('td[3]').text
                weight = row.find_element_by_xpath('td[4]').text
                dob = row.find_element_by_xpath('td[5]').text

                splitname = name.split(',')
                lastname = splitname[0]
                firstname = splitname[-1]
                abbrev_name = firstname[0] + ". " + lastname

                player_team[abbrev_name] = team
                player_height[abbrev_name] = height

with open(working_dir + '/team_roster' + year + '.pickle', 'wb') as handle:
    pickle.dump(player_team, handle, protocol=pickle.HIGHEST_PROTOCOL)

I then built two functions, one to test whether our code could determine who won the jump ball, and another to actually write the winner to a csv file. I found that a number of cases occured where the code could not determine who won. Looking more closely into these cases, it looks like this is due to trades that occurred during the season

with open(working_dir + '/jump_ball_processed.csv', 'w') as fd:
    fd.write("Winner,Loser\n")

def write_win(player1, player2, poss_player):
    # Player 1 won the jump ball
    if (player_team[player1] == player_team[poss_player]):
        with open(working_dir + '/jump_ball_processed.csv', 'a') as fd:
            fd.write(player1 + "," + player2 + '\n')
    # Player 2 won the jump ball
    elif (player_team[player2] == player_team[poss_player]):
        with open(working_dir + '/jump_ball_processed.csv', 'a') as fd:
            fd.write(player2 + "," + player1 + '\n')
    else:
        # Neither player won
        print('Neither player won: ' + player1 + ' ' + player2 + ' Posessing Player = ' + poss_player)

def test_win(player1, player2, poss_player):
    if (player_team[player1] == player_team[poss_player]):
        winning_players.append(player1)
        losing_players.append(player2)
    elif (player_team[player2] == player_team[poss_player]):
        winning_players.append(player2)
        losing_players.append(player1)

I then went through the play by play data and looked for each case of a jump ball at the beginning of play or at the beginning of a OT. I then filtered to find only the centers who particpated in more than 30 jump balls and had occurences of winning and losing a jump ball. The final result was all 30 starting centers in the NBA

for file in os.listdir(working_dir):
    if file.endswith(".pickle") and "team_roster" in file:
        with open(working_dir + "/" + file, 'rb') as handle:
            player_team = pickle.load(handle)

winning_players = list()
losing_players = list()
with open(working_dir + "/nba_18_19.csv") as fp:
    reader = csv.reader(fp, delimiter=",", quotechar='"')
    # next(reader, None)  # skip the headers
    for row in reader:
        # 10th time remaining in quarter (seconds)
        # 15 is play
        #Jump ball: L.Aldridge vs. J.McGee (P.Mills gains possession)
        if row[10] == '720' and "Jump ball" in row[15]:
            # We have a jump ball at the beginning of play
            play = row[15].replace('Jump ball: ', '')
            arr = play.split(' vs. ')
            player1 = arr[0]
            arr = arr[1].split(' (')
            player2 = arr[0]
            poss_player = arr[1].split(' gains')[0]
            test_win(player1, player2, poss_player)

        if row[10] == '300' and "Jump ball" in row[15]:
            # Overtime
            play = row[15].replace('Jump ball: ', '')
            arr = play.split(' vs. ')
            player1 = arr[0]
            arr = arr[1].split(' (')
            player2 = arr[0]
            poss_player = arr[1].split(' gains')[0]
            test_win(player1, player2, poss_player)


with open(working_dir + "/nba_18_19.csv") as fp:
    reader = csv.reader(fp, delimiter=",", quotechar='"')
    # next(reader, None)  # skip the headers
    for row in reader:
        if row[10] == '720' and "Jump ball" in row[15]:
            # We have a jump ball at the beginning of play
            play = row[15].replace('Jump ball: ', '')
            arr = play.split(' vs. ')
            player1 = arr[0]
            arr = arr[1].split(' (')
            player2 = arr[0]
            poss_player = arr[1].split(' gains')[0]
            if player1 in winning_players and player2 in winning_players and player1 in losing_players and player2 in losing_players:
                if(winning_players.count(player1) + losing_players.count(player1) > 30) and \
                        (winning_players.count(player2) + losing_players.count(player2) > 30):
                    write_win(player1, player2, poss_player)

        if row[10] == '300' and "Jump ball" in row[15]:
            # Overtime
            play = row[15].replace('Jump ball: ', '')
            arr = play.split(' vs. ')
            player1 = arr[0]
            arr = arr[1].split(' (')
            player2 = arr[0]
            poss_player = arr[1].split(' gains')[0]
            if player1 in winning_players and player2 in winning_players and player1 in losing_players and player2 in losing_players:
                if(winning_players.count(player1) + losing_players.count(player1) > 30) and \
                        (winning_players.count(player2) + losing_players.count(player2) > 30):
                    write_win(player1, player2, poss_player)

Load our processed data from python into R

I loaded the jump ball data from Python. The data has 727 observations, each one corresponding to the jump-ball winner and loser

jump_ball_2018_2019 = read.csv('./jump_ball_data/jump_ball_processed.csv', header=TRUE)
head(jump_ball_2018_2019)

##        Winner      Loser
## 1   J. Embiid A. Horford
## 2    B. Lopez  C. Zeller
## 3 A. Drummond   J. Allen
## 4    A. Davis  C. Capela
## 5   M. Turner   M. Gasol
## 6   M. Gortat   N. Jokić

Bradley Terry Model

Then I created a bradley-terry model, using Anthony Davis as the reference category

library(BradleyTerry2)
result <- rep(1, length(jump_ball_2018_2019$Winner))
BTmodel <- BTm(result, Winner, Loser, br = TRUE, data = jump_ball_2018_2019, refcat = 'A. Davis')
summary(BTmodel)

## 
## Call:
## BTm(outcome = result, player1 = Winner, player2 = Loser, refcat = "A. Davis", 
##     data = jump_ball_2018_2019, br = TRUE)
## 
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## ..A. Drummond      0.49125    0.44488   1.104 0.269492    
## ..A. Horford      -0.79783    0.45899  -1.738 0.082170 .  
## ..B. Lopez         0.40437    0.43270   0.935 0.350035    
## ..C. Capela       -0.50013    0.41768  -1.197 0.231152    
## ..C. Zeller       -2.11943    0.58908  -3.598 0.000321 ***
## ..D. Ayton        -1.70565    0.48127  -3.544 0.000394 ***
## ..D. Dedmon       -0.41477    0.48310  -0.859 0.390583    
## ..D. Jordan       -0.41027    0.43639  -0.940 0.347145    
## ..H. Whiteside    -0.42475    0.47221  -0.899 0.368394    
## ..I. Zubac        -1.47032    0.59831  -2.457 0.013992 *  
## ..J. Allen         0.06548    0.43464   0.151 0.880249    
## ..J. Embiid       -0.45585    0.44603  -1.022 0.306779    
## ..J. McGee         0.39295    0.46956   0.837 0.402677    
## ..J. Nurkić        0.38518    0.45037   0.855 0.392403    
## ..K. Durant       -0.56366    0.49581  -1.137 0.255608    
## ..K. Towns        -0.23520    0.42324  -0.556 0.578400    
## ..L. Aldridge     -1.39566    0.48918  -2.853 0.004330 ** 
## ..M. Gasol        -1.04732    0.44163  -2.372 0.017716 *  
## ..M. Gortat        0.41836    0.49161   0.851 0.394767    
## ..M. Turner       -2.27198    0.53542  -4.243  2.2e-05 ***
## ..N. Jokić        -1.05886    0.46413  -2.281 0.022526 *  
## ..N. Vučević      -1.13943    0.44629  -2.553 0.010676 *  
## ..R. Gobert        0.21824    0.41909   0.521 0.602550    
## ..R. Lopez         0.18075    0.57426   0.315 0.752954    
## ..S. Adams        -0.10006    0.42736  -0.234 0.814882    
## ..S. Ibaka        -0.72202    0.46718  -1.545 0.122230    
## ..T. Bryant       -1.51415    0.53497  -2.830 0.004650 ** 
## ..T. Thompson     -1.28290    0.53191  -2.412 0.015871 *  
## ..W. Carter       -1.45579    0.53086  -2.742 0.006101 ** 
## ..W. Cauley-Stein -0.90601    0.42824  -2.116 0.034372 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 903.36  on 727  degrees of freedom
## Residual deviance: 834.20  on 697  degrees of freedom
## Penalized deviance: 771.7662 
## AIC:  894.2

Bradley Terry Model Interpretation

To determine the probability of Player 1 winning a jump ball vs Player 2 in the NBA, I use the formula where

exp(beta1) / (exp(beta1) + exp(beta2))

In this case, the reference category is Anthony Davis, so I calculated all of the probabilities of Anthony Davis vs the other centers in the league. The results below show that Mylers Turner (9% chance to beat AD) was the worst jump ball player and Andre Drummond (62% chance to beat AD) was the best jump ball in the NBA in 2018-2019 season.

I also performed a likelihood ratio test of the model vs the null model (each center has equal chance of winning the jump ball). The likelihood ratio test reject the null hypothesis and therefore the Bradley Terry model is an improved model compared to the null model.

exp(BTabilities(BTmodel)[,1])/(1+exp(BTabilities(BTmodel)[,1]))

##       M. Turner       C. Zeller        D. Ayton       T. Bryant        I. Zubac 
##      0.09346993      0.10722246      0.15372824      0.18032500      0.18689387 
##       W. Carter     L. Aldridge     T. Thompson      N. Vučević        N. Jokić 
##      0.18911184      0.19850627      0.21705713      0.24242497      0.25752774 
##        M. Gasol W. Cauley-Stein      A. Horford        S. Ibaka       K. Durant 
##      0.25973975      0.28781698      0.31048989      0.32694920      0.36270135 
##       C. Capela       J. Embiid    H. Whiteside       D. Dedmon       D. Jordan 
##      0.37751125      0.38797152      0.39538103      0.39776841      0.39884785 
##        K. Towns        S. Adams        A. Davis        J. Allen        R. Lopez 
##      0.44146843      0.47500604      0.50000000      0.51636417      0.54506388 
##       R. Gobert       J. Nurkić        J. McGee        B. Lopez       M. Gortat 
##      0.55434387      0.59512271      0.59699267      0.59973737      0.60309034 
##     A. Drummond 
##      0.62040007

1 - pchisq(BTmodel$null.deviance - BTmodel$deviance, BTmodel$df.null - BTmodel$df.residual)

## [1] 6.270164e-05

Conclusion

Note that I had also seen some analysis done before using a ELO model here

Both this model as well as the ELO model identified Andre Drummond, Rudy Gobert and Javale McGee as some of the best jump ball players.

In the future, I would work to resolve the data where players were traded as well as incorporate more season's of data to improve the model.