FiveThirtyEight is one of my favourite places to read thoughtful analysis, and see data visualizations. I was inspired by one of their posts on 'how your favourite baseball team blows its money'. As an avid basketball fan, I recreated this graphic with NBA data.
This was incredible practice with Matplotlib. Thank you to Amy Gordon for her Github for the style guide.
This chart illustrates the relationship between team spending and win percentage for each basketball season from 2001 - 2015. Essentially, the steeper the curve, the better the team was maximizing value.
More money, more wins. There is a rough trend suggesting that higher team salaries lead to a greater winning percentage. This relationship is particularly strong for the Boston Celtics and Golden State Warriors.
On the other hand, this relationship is flat for others. For example, irrespective of team salary, the San Antonio Spurs is a winning organization. They are better at managing its roster and payroll versus others. They pride themselves with developing talent organically with scouting, coaching, and player development. Rather than overpaying Superstar players in the free agency market, they sign undervalued players (players who contribute more to win share, relatively their salary would suggest). This is similar to the thinking in MoneyBall, and has been widely studied in sports. Coupled with the fact that there is usually a salary cap limit in professional sports, this becomes a very interesting dynamic.
Worst out of all the teams, New York Knicks tend to overpay players, which don't yield any marginal wins on the court and underperform relative to expectations.
Game Data: https://www.kaggle.com/fivethirtyeight/fivethirtyeight-nba-elo-dataset
Salary data (exported to CSV): https://hoopshype.com/salaries/
For those interested, keep reading for data wrangling and snippets of code.
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
# blog html visualization
from IPython.display import HTML
# plt.rcParams['figure.dpi'] = 50
# %config InlineBackend.figure_format = 'retina'
%config InlineBackend.figure_format = 'svg'
Salary data
raw=pd.read_csv('./data/nba_team_salary.csv')
# reorganize columns in chronological order
salary = pd.concat([raw["team"], raw[raw.columns[:0:-1]]], axis=1)
# change NaN to zero for no salary expansion teams: Charlotte 2003, 2004; New Orleans 2001, 2002
salary.fillna("$0", inplace=True)
# rename city to mascot name to merge later with games dataset
teams = {"team":["Hawks","Celtics","Nets","Hornets","Bulls","Cavaliers","Mavericks","Nuggets",
"Pistons","Warriors","Rockets","Pacers","Clippers","Lakers","Grizzlies","Heat",
"Bucks","Timberwolves","Pelicans","Knicks","Thunder","Magic","Sixers","Suns",
"Trailblazers","Kings","Spurs","Raptors","Jazz","Wizards"]}
salary = pd.concat([pd.DataFrame(teams),salary.iloc[:,1:16]],axis=1)
# transform wide to long data for plotting
# 450 columns: 30 teams x 15 years
long_salary = pd.melt(salary, id_vars='team', var_name='year', value_name='salary')
long_salary.head()
# year into int
long_salary["year"] = long_salary["year"].apply(lambda x: int(x))
# salary into int
long_salary["salary"] = long_salary["salary"].apply(lambda x: int(x.replace("$","").replace(",","")))
To make the data more comparable year over year, let's standardize the salaries. This function calculates how many standard deviations a team's salary is from the league average for a given year and returns that value.
def z_salary(df,year,salary):
year_mean = df[(df['year']==year)]['salary'].mean()
year_std = df[(df['year']==year)]['salary'].std()
return (salary - year_mean)/year_std
long_salary['z_salary'] = long_salary.apply(lambda row: z_salary(long_salary, row["year"], row["salary"]), axis=1)
HTML(long_salary.head().to_html(classes="table table-responsive table-striped table-bordered"))
team | year | salary | z_salary | |
---|---|---|---|---|
0 | Hawks | 2001 | 57438766 | -0.689421 |
1 | Celtics | 2001 | 75244004 | 0.086161 |
2 | Nets | 2001 | 100821283 | 1.200288 |
3 | Hornets | 2001 | 67881439 | -0.234546 |
4 | Bulls | 2001 | 43399261 | -1.300970 |
# export to csv
long_salary.to_csv("./data/nba_team_salary_final.csv")
Games data
From this NBA Elo dataset from Kaggle on FiveThirtyEight, I'll need to tally the following for each season. * G: Games play * W: Games won * L: Games lost
In the future, I'll be web scraping, which will make this easier.
games = pd.read_csv('./data/nba_elo.csv')
# I'll only look at the regular season to keep n consistent across teams
games = games[games["is_playoffs"] == 0]
# This dataset has a lot of extraneous information, so I'll focus on the columns of interest.
games = games[["year_id","date_game","team_id","fran_id","opp_id","opp_fran","game_result"]]
# This dataset is from 1947-2019, so I'll filter for years 2001 to 2015 to match our Salary data
games = games[(games["year_id"] >= 2001) & (games["year_id"] <= 2015)]
HTML(games.head().to_html(classes="table table-responsive table-striped table-bordered"))
year_id | date_game | team_id | fran_id | opp_id | opp_fran | game_result | |
---|---|---|---|---|---|---|---|
87750 | 2001 | 10/31/2000 | ATL | Hawks | CHH | Pelicans | L |
87751 | 2001 | 10/31/2000 | CHH | Pelicans | ATL | Hawks | W |
87752 | 2001 | 10/31/2000 | CHI | Bulls | SAC | Kings | L |
87753 | 2001 | 10/31/2000 | SAC | Kings | CHI | Bulls | W |
87754 | 2001 | 10/31/2000 | DAL | Mavericks | MIL | Bucks | W |
# reset record to beginning with list comphrension
record = {team:{"G":0,"W":0,"L":0} for team in teams}
# Each game shows up for the home and away team, so I'll only need to run control flow for when they lose
# Repeat for each year
def create_record(year):
for i, row in games[games["year_id"]==year].iterrows():
if row.game_result == "L":
record[row.fran_id]["G"] += 1
record[row.opp_fran]["G"] += 1
record[row.fran_id]["L"] += 1
record[row.opp_fran]["W"] += 1
return record
# set up list of each team and each year
teams = sorted(list(games.fran_id.unique()))
all_years = list(games.year_id.unique())
# now repeat for all the years in a dictionary object
seasons = {}
for year in all_years:
record = {team:{"G":0,"W":0,"L":0} for team in teams}
seasons[year] = create_record(year)
create_record(2001)
{'Bucks': {'G': 164, 'W': 93, 'L': 71},
'Bulls': {'G': 164, 'W': 65, 'L': 99},
'Cavaliers': {'G': 164, 'W': 83, 'L': 81},
'Celtics': {'G': 164, 'W': 76, 'L': 88},
'Clippers': {'G': 164, 'W': 87, 'L': 77},
'Grizzlies': {'G': 164, 'W': 78, 'L': 86},
'Hawks': {'G': 164, 'W': 85, 'L': 79},
'Heat': {'G': 164, 'W': 87, 'L': 77},
'Hornets': {'G': 82, 'W': 33, 'L': 49},
'Jazz': {'G': 164, 'W': 91, 'L': 73},
'Kings': {'G': 164, 'W': 84, 'L': 80},
'Knicks': {'G': 164, 'W': 65, 'L': 99},
'Lakers': {'G': 164, 'W': 77, 'L': 87},
'Magic': {'G': 164, 'W': 68, 'L': 96},
'Mavericks': {'G': 164, 'W': 103, 'L': 61},
'Nets': {'G': 164, 'W': 64, 'L': 100},
'Nuggets': {'G': 164, 'W': 70, 'L': 94},
'Pacers': {'G': 164, 'W': 79, 'L': 85},
'Pelicans': {'G': 164, 'W': 91, 'L': 73},
'Pistons': {'G': 164, 'W': 64, 'L': 100},
'Raptors': {'G': 164, 'W': 96, 'L': 68},
'Rockets': {'G': 164, 'W': 101, 'L': 63},
'Sixers': {'G': 164, 'W': 74, 'L': 90},
'Spurs': {'G': 164, 'W': 113, 'L': 51},
'Suns': {'G': 164, 'W': 90, 'L': 74},
'Thunder': {'G': 164, 'W': 89, 'L': 75},
'Timberwolves': {'G': 164, 'W': 63, 'L': 101},
'Trailblazers': {'G': 164, 'W': 101, 'L': 63},
'Warriors': {'G': 164, 'W': 84, 'L': 80},
'Wizards': {'G': 164, 'W': 65, 'L': 99}}
# rename index column to team
df_seasons = pd.DataFrame(seasons).reset_index().rename(columns={'index':'team'})
# wide to long data for plotting
long_seasons = pd.melt(df_seasons, id_vars='team', var_name='year', value_name='seasons')
# Break out the G,W,L into columns
items = list(long_seasons["seasons"])
long_seasons["G"] = pd.DataFrame([item["G"] for item in items])
long_seasons["L"] = pd.DataFrame([item["L"] for item in items])
long_seasons["W"] = pd.DataFrame([item["W"] for item in items])
# Calculate winning % for each team per year, and add as a new column
long_seasons['win_%'] = round(long_seasons['W']/long_seasons['G'],2)
# drop seasons, redundant column with G, L, W
long_seasons = long_seasons.drop(columns=['seasons'])
# make year into int so it can be merged later
long_seasons["year"] = long_seasons["year"].apply(lambda x: int(x))
HTML(long_seasons.head().to_html(classes="table table-responsive table-striped table-bordered"))
team | year | G | L | W | win_% | |
---|---|---|---|---|---|---|
0 | Bucks | 2001 | 82 | 30 | 52 | 0.63 |
1 | Bulls | 2001 | 82 | 67 | 15 | 0.18 |
2 | Cavaliers | 2001 | 82 | 52 | 30 | 0.37 |
3 | Celtics | 2001 | 82 | 46 | 36 | 0.44 |
4 | Clippers | 2001 | 82 | 51 | 31 | 0.38 |
Now, I'll merge the two datasets (long_salary, long_seasons) together into one dataframe for scatterplots
# merge datasets together
final = long_salary.merge(long_seasons, on=["team","year"])
final
# add divisions data
divisions_dict = {"Atlantic": ["Celtics","Nets","Sixers","Raptors","Knicks"],
"Central": ["Bulls","Cavaliers","Pistons","Bucks","Pacers"],
"Southeast": ["Hawks","Hornets","Heat","Magic","Wizards"],
"Northwest": ["Timberwolves","Nuggets","Trailblazers","Thunder","Jazz"],
"Pacific" : ["Warriors","Clippers","Lakers","Kings","Suns"],
"Southwest" : ["Mavericks", "Rockets","Grizzlies","Pelicans","Spurs"]}
divisions = ["Atlantic","Central","Southeast","Northwest","Pacific","Southwest"]
def team_to_division(team):
for division in divisions:
if team in divisions_dict[division]:
return division
final["division"] = final["team"].apply(lambda x: team_to_division(x))
# add conference data
conferences = ["East","West"]
conferences_dict = {"East": ["Celtics","Nets","Sixers","Raptors","Knicks","Bulls","Cavaliers","Pistons","Bucks","Pacers","Hawks","Hornets","Heat","Magic","Wizards"],
"West": ["Timberwolves","Nuggets","Trailblazers","Thunder","Jazz","Warriors","Clippers","Lakers","Kings","Suns","Mavericks", "Rockets","Grizzlies","Pelicans","Spurs"]}
def division_to_conference(division):
for conference in conferences:
if division in conferences_dict[conference]:
return conference
final["conference"] = final["team"].apply(lambda x: division_to_conference(x))
# Remove early Charlotte Hornet years, missing data;
# they relocated to New Orleans, and were renamed as the Pelicans
final[final["team"] == 'Hornets'].loc[[3,33,63,93]]
final = final.drop([3,33,63,93])
final.to_csv("./data/final.csv")
raptors = final[final['team']== 'Raptors']
# change way we invoke plot so can manipulate the axis
fig, ax = plt.subplots(figsize=(3, 3))
#ax.scatter(x, y)
ax.scatter(x=raptors['z_salary'], y=raptors['win_%'],alpha=0.5,c="#CE1141")
plt.title('Toronto Raptors', position = (0,1), ha = 'left', fontsize=12)
plt.xlabel("Standardized Salaries", position = (0,0), ha = 'left', color = 'grey')
plt.ylabel('Win %', position = (0, 1), ha = 'right', color = 'grey')
# add linear regression line
X = pd.DataFrame(final[final["team"]=="Raptors"]["z_salary"])
Y = pd.DataFrame(raptors["win_%"])
linreg = linear_model.LinearRegression()
model = linreg.fit(X, Y)
predictions = model.predict(X)
plt.plot(X, predictions, color='#CE1141',linewidth=2)
# remove the top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
# add in gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
# create cross hairs
plt.hlines(0.5,-3, 3,alpha=0.5)
plt.vlines(0,0,1,alpha=0.5)
# reduce ticks on x axis
plt.xticks([-2,-1,0,1,2])
plt.show()
Compare this with the plot for the league
# change way we invoke plot so can manipulate the axis
fig, ax = plt.subplots(figsize=(3, 3))
#ax.scatter(x, y)
scatter_league = (ax.scatter(x=final['z_salary'], y=final['win_%'],alpha=0.5,c="#000000"),
plt.title('NBA League', position = (0,1), ha = 'left', fontsize=12),
plt.xlabel("Standardized Salaries", position = (0,0), ha = 'left', color = 'grey'),
plt.ylabel('Win %', position = (0, 1), ha = 'right', color = 'grey'))
# add linear regression line
X = pd.DataFrame(final["z_salary"])
Y_league = pd.DataFrame(final["win_%"])
linreg = linear_model.LinearRegression()
model = linreg.fit(X, Y_league)
predictions_league = model.predict(X)
# remove the top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
# add in gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
# create cross hairs
plt.hlines(0.5,-3, 3,alpha=0.5)
plt.vlines(0,0,1,alpha=0.5)
# reduce ticks on x axis
plt.xticks([-2,-1,0,1,2])
league_plot = plt.plot(X, predictions_league, color='#000000',linewidth=2,alpha=0.25)
plt.show()
# Create a dictionary with the team names, and team plot colour. I'll do this with Atlantic,
# then loop through each division on each row
atlantic = {"Celtics": "#007A33",
"Nets": "#000000",
"Sixers": "#006BB6",
"Raptors": "#CE1141",
"Knicks": "#F58426"}
def plot_team_row(team_df, ax,team_color='#00000'):
# linear regression
X = pd.DataFrame(team_df["z_salary"])
Y = pd.DataFrame(team_df["win_%"])
linreg = linear_model.LinearRegression()
model = linreg.fit(X, Y)
predictions = model.predict(X)
ax.plot(X, predictions, color=team_color,alpha=0.8,linewidth=2)
# plot league average
ax.plot(pd.DataFrame(final['z_salary']), predictions_league, color='#C4CED4',alpha=0.5,linewidth=2);
# scatterplot
ax.scatter(x=team_df['z_salary'], y=team_df['win_%'],alpha=0.5,c=team_color,s=10)
# title for each team subplot
ax.set_title(f"{team_df['team'].values[0]}",fontsize=10)
# create cross hairs
ax.hlines(0.5,-3, 3,alpha=0.5,linewidth=2)
ax.vlines(0,.10,.90,alpha=0.5,linewidth=2)
# ticks
plt.xticks([-2,0,2])
# remove borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
# add in gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
ax
object is an array with each array defining one subplotplot_team_row()
functionfig, ax = plt.subplots(nrows=1, ncols=5,figsize=(6,2),sharex=True, sharey=True)
# Division title
plt.figtext(0.48,1, "Atlantic", fontsize=10, color="grey")
# Spacing for subplots
fig.subplots_adjust(hspace = 0.5, wspace=.001, top=0.825)
# pos determines which axes object to pass into the function
#there are 5 pos
col=0
for team, colour in atlantic.items():
plot_team_row(final[final['team']==team], ax[col],colour)
col=col+1
# force label to appear on first plot rather than the last plot
ax[0].set_xlabel('Standardized Salaries', fontsize=10)
ax[0].set_ylabel('Win %', fontsize=10)
ax[0].tick_params(labelsize=8)
ax[1].tick_params(labelsize=8)
ax[2].tick_params(labelsize=8)
ax[3].tick_params(labelsize=8)
ax[4].tick_params(labelsize=8)
plt.show()
# team colours source: https://teamcolorcodes.com/
atlantic = {"Celtics": "#007A33",
"Nets": "#000000",
"Sixers": "#006BB6",
"Raptors": "#CE1141",
"Knicks": "#F58426"}
central = {"Bulls": "#CE1141",
"Cavaliers": "#6F263D",
"Pistons": "#C8102E",
"Bucks": "#00471B",
"Pacers": "#FDBB30"}
southeast = {"Hawks": "#E03A3E",
"Hornets":"#00788C",
"Heat": "#98002E",
"Magic": "#0077C0",
"Wizards": "#002B5C"}
northwest = {"Timberwolves": "#236192",
"Nuggets": "#FEC524",
"Trailblazers": "#E03A3E",
"Thunder": "#007AC1",
"Jazz": "#00471B"}
pacific = {"Warriors": "#FDB927",
"Lakers": "#552583",
"Clippers": "#C8102E",
"Kings": "#5A2D81",
"Suns": "#E56020"}
southwest= {"Mavericks": "#00538C",
"Rockets": "#CE1141",
"Grizzlies": "#5D76A9",
"Pelicans": "#85714D",
"Spurs": "#C4CED4"}
nba_league = [atlantic, central, southeast, northwest, pacific, southwest]
# 6 rows x 5 columns for all 30 teams
fig, ax = plt.subplots(nrows=6, ncols=5,figsize=(6,6),sharex=True, sharey=True)
# plot title
fig.suptitle('More Money, More Wins', x=0.55, y=1,color = 'black', fontsize=12)
# Subplots spacing
fig.subplots_adjust(hspace = .5, wspace=.001, top=0.925)
# ax[count][pos]: the axes object contains a numpy array
# count indicates the row
# pos indicates the column
# loop through each division, each team
row = 0
for division in nba_league:
col=0
for team,color in division.items():
plot_team_row(final[final['team']==team], ax[row][col],color)
col=col+1
row += 1
# labels
ax[5][2].set_xlabel('Standardized Salary', fontsize=8)
ax[0][0].set_ylabel('Win %', fontsize=8)
# ticks
for i in list(range(0,5)):
ax[5][i].tick_params(labelsize=8)
for i in list(range(0,5)):
ax[i][0].tick_params(labelsize=8)
plt.figtext(1,0.875, "Atlantic", fontsize=10, color="grey")
plt.figtext(1,0.725, "Central", fontsize=10, color="grey")
plt.figtext(1,0.575, "Southeast", fontsize=10, color="grey")
plt.figtext(1,0.425, "Northwest", fontsize=10, color="grey")
plt.figtext(1,0.275, "Pacific", fontsize=10, color="grey")
plt.figtext(1,0.125, "Southwest", fontsize=10, color="grey")
plt.show()
# plt.savefig('nba_team_salary.pdf')