How likely is a team to make the Final Four of the NCAA Tournament?
Each year, close to $4 billion is wagered on the NCAA Division 1 men's basketball tournament. Most of that money is wagered where the objective is to correctly predict winners of each game, with emphasis on the last four teams remaining (the Final Four).
In this project, my motivation is the following:
Based on a college's regular season performance and seeding information, can I predict whether or not they will reach the final four?
What are the variables that are correlated to predicting teams that make it to the final four? As a corollary, my model will also output the associated probabilities of making it to the final 4. Am I able to outperform a naive model? As even the sport pundits will tell you, since 2008, 53% of the time, at least 2 No. 1 seeds make the final four. So just by choosing two No 1. seeds, you're half way there.
As a trusted advisor to Coach Krzyzewski (Duke) or Coach Izzo (Michigan State), how would I recommend spending time developing the team?
Or at the very least, how do I improve my 2020 bracket to make some money?
# data wrangling
import pandas as pd
import numpy as np
# plotting
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
from IPython.display import HTML
# preprocessing & feature engineering
from sklearn.preprocessing import StandardScaler, LabelBinarizer, PolynomialFeatures
from sklearn_pandas import DataFrameMapper, CategoricalImputer, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler
# modelling & evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import roc_auc_score, confusion_matrix
# scientific notation off
np.set_printoptions(suppress=True)
pd.options.display.float_format = '{:.2f}'.format
# suppress warnings
from sklearn.exceptions import DataConversionWarning
import warnings
warnings.filterwarnings(action='ignore')
The data spans 2003 to 2017 and is compiled from sports-reference.com.
The data is spread across four files:
regular_season.csv
- gamelogs for every regular season gameteams.csv
- team_id, names, and conferencesmarch_madness.csv
- gamelogs for each NCAA tournament gamemarch_madness_seeds.csv
- entry seeds for each team (W, X, Y, Z indicate the region)regular = pd.read_csv("./data/ncaa_data/regular_season.csv")
HTML(regular.tail(3).to_html(classes="table table-responsive table-striped table-bordered"))
season | day_in_season | winning_team_id | winning_team_score | losing_team_id | losing_team_score | winning_team_field_goals | winning_team_field_goals_attempted | winning_team_three_points | winning_team_three_points_attempted | winning_team_free_throws | winning_team_free_throws_attempted | winning_team_offensive_rebounds | winning_team_defensive_rebounds | winning_team_assists | winning_team_turnovers | winning_team_steals | winning_team_blocks | winning_team_personal_fouls | losing_team_field_goals | losing_team_field_goals_attempted | losing_team_three_points | losing_team_three_points_attempted | losing_team_free_throws | losing_team_free_throws_attempted | losing_team_offensive_rebounds | losing_team_defensive_rebounds | losing_team_assists | losing_team_turnovers | losing_team_steals | losing_team_blocks | losing_team_personal_fouls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
76633 | 2017 | 132 | 1348 | 70 | 1433 | 63 | 24 | 54 | 8 | 20 | 14 | 19 | 9 | 27 | 12 | 6 | 3 | 7 | 18 | 21 | 67 | 4 | 14 | 17 | 22 | 23 | 24 | 8 | 5 | 4 | 1 | 16 |
76634 | 2017 | 132 | 1374 | 71 | 1153 | 56 | 26 | 52 | 10 | 19 | 9 | 13 | 7 | 27 | 14 | 8 | 2 | 6 | 15 | 19 | 61 | 4 | 24 | 14 | 18 | 17 | 22 | 7 | 7 | 7 | 1 | 13 |
76635 | 2017 | 132 | 1407 | 59 | 1402 | 53 | 21 | 60 | 1 | 17 | 16 | 19 | 14 | 19 | 5 | 5 | 10 | 3 | 10 | 20 | 48 | 6 | 17 | 7 | 8 | 9 | 27 | 10 | 17 | 1 | 7 | 18 |
Let's interpret the last row. - 1407 = Troy, 1402 = Texas State. - Troy beat Texas State 59 to 53 on 132 day in season which is Sunday, March 2017. (Technically, 2016 - 2017 Season). Notice that all games finish on day 132 in each year.
From the regular season, there are:
teams = pd.read_csv("./data/ncaa_data/teams.csv")
HTML(teams.head(3).to_html(classes="table table-responsive table-striped table-bordered"))
season | team_id | team_name | conference_code | conference_name | |
---|---|---|---|---|---|
0 | 2014 | 1101 | Abilene Chr | southland | Southland Conference |
1 | 2015 | 1101 | Abilene Chr | southland | Southland Conference |
2 | 2016 | 1101 | Abilene Chr | southland | Southland Conference |
teams = teams.drop(columns=["conference_name"])
From the teams data:
mm = pd.read_csv("./data/ncaa_data/march_madness.csv")
HTML(mm.head(3).to_html(classes="table table-responsive table-striped table-bordered"))
season | day_in_season | winning_team_id | winning_team_score | losing_team_id | losing_team_score | |
---|---|---|---|---|---|---|
0 | 1985 | 136 | 1116 | 63 | 1234 | 54 |
1 | 1985 | 136 | 1120 | 59 | 1345 | 58 |
2 | 1985 | 136 | 1207 | 68 | 1250 | 43 |
From the march madness data:
mm = mm[mm["season"]>=2003]
seeds = pd.read_csv("./data/ncaa_data/march_madness_seeds.csv")
HTML(seeds.head(3).to_html(classes="table table-responsive table-striped table-bordered"))
season | seed | team_id | |
---|---|---|---|
0 | 2003 | W01 | 1328 |
1 | 2003 | W02 | 1448 |
2 | 2003 | W03 | 1393 |
seeds["seed"] = seeds["seed"].apply(lambda x: int(x[1:3]))
seed_and_names = pd.merge(seeds, teams, how="left", on=["season","team_id"]).drop_duplicates()
HTML(seed_and_names.head(3).to_html(classes="table table-responsive table-striped table-bordered"))
season | seed | team_id | team_name | conference_code | |
---|---|---|---|---|---|
0 | 2003 | 1 | 1328 | Oklahoma | big_twelve |
22 | 2003 | 2 | 1448 | Wake Forest | acc |
56 | 2003 | 3 | 1393 | Syracuse | big_east |
HTML(mm[mm["season"]==2017].tail().to_html(classes="table table-responsive table-striped table-bordered"))
season | day_in_season | winning_team_id | winning_team_score | losing_team_id | losing_team_score | final_four | |
---|---|---|---|---|---|---|---|
2112 | 2017 | 146 | 1314 | 75 | 1246 | 73 | 0 |
2113 | 2017 | 146 | 1376 | 77 | 1196 | 70 | 0 |
2114 | 2017 | 152 | 1211 | 77 | 1376 | 73 | 1 |
2115 | 2017 | 152 | 1314 | 77 | 1332 | 76 | 1 |
2116 | 2017 | 154 | 1314 | 71 | 1211 | 65 | 0 |
We can see the championship game was also played on 154, final four played on 152! Perfect, now we can flag all of these teams as final four = 1
We'll query the winning_team_id and losing_team_id for each of those games to get the final four teams
# 15 seasons, we should get 2*15 = 30 final four games
mm["final_four"]=mm["day_in_season"].apply(lambda x: 1 if x == 152 else 0)
Let's define X and y so we can more easily perform EDA, feature engineering and modeling. First, we'll need a target vector y, with all the teams in each season and whether or not they made it to the Final Four.
Let's work off the seeds_and_names data frame, as those are all of our 996 participating colleges for each season.
seed_and_names["final_four"]=np.zeros(len(seed_and_names)) # zeros
final_four_list = (list(zip(mm.query("final_four==1").season,mm.query("final_four==1").winning_team_id))+
list(zip(mm.query("final_four==1").season,mm.query("final_four==1").losing_team_id)))
# fill in teams with 1 if final_four team
for season, team_id in final_four_list:
seed_and_names["final_four"]+=np.where((seed_and_names.season==season) & (seed_and_names.team_id==team_id),1,0)
HTML(seed_and_names.query("final_four==1 & season==2017").to_html(classes="table table-responsive table-striped table-bordered"))
season | seed | team_id | team_name | conference_code | final_four | |
---|---|---|---|---|---|---|
23045 | 2017 | 7 | 1376 | South Carolina | sec | 1.0 |
23292 | 2017 | 1 | 1211 | Gonzaga | wcc | 1.0 |
23604 | 2017 | 3 | 1332 | Oregon | pac_twelve | 1.0 |
23905 | 2017 | 1 | 1314 | North Carolina | acc | 1.0 |
Great, now we have a final_four which will be our target y. Now we have to aggregate the stats for each team, over all the 76636 games played.
e.g. for South Carolina, we now need the following aggregated stats for, points for, points against, etc. Aggregating opponents stats so we have a sense of offensive and defensive ability because defense wins champions!
def stat_for(seed_and_names,stat,name):
df_w = regular.pivot_table(index="winning_team_id", columns="season", values=f'winning_team_{stat}', aggfunc=np.sum)
df_l = regular.pivot_table(index="losing_team_id", columns="season", values=f'losing_team_{stat}', aggfunc=np.sum)
df_for = df_w + df_l
df_for.reset_index(inplace=True)
df_for_melt = df_for.melt(id_vars=["winning_team_id"])
df_for_melt.columns=["team_id", "season",name]
return pd.merge(seed_and_names, df_for_melt, how="left", on=["season","team_id"])
seed_and_names = stat_for(seed_and_names,'score','points_for')
seed_and_names = stat_for(seed_and_names,'field_goals','fg_for')
seed_and_names = stat_for(seed_and_names,'field_goals_attempted','fga_for')
seed_and_names = stat_for(seed_and_names,'three_points','3pm_for')
seed_and_names = stat_for(seed_and_names,'three_points_attempted','3pa_for')
seed_and_names = stat_for(seed_and_names,'free_throws','ft_for')
seed_and_names = stat_for(seed_and_names,'free_throws_attempted','fta_for')
seed_and_names = stat_for(seed_and_names,'offensive_rebounds','off_rebounds_for')
seed_and_names = stat_for(seed_and_names,'defensive_rebounds','def_rebounds_for')
seed_and_names = stat_for(seed_and_names,'assists','assists_for')
seed_and_names = stat_for(seed_and_names,'steals','steals_for')
seed_and_names = stat_for(seed_and_names,'blocks','blocks_for')
seed_and_names = stat_for(seed_and_names,'turnovers','turnovers_for')
seed_and_names = stat_for(seed_and_names,'personal_fouls','fouls_for')
def stat_against(seed_and_names,stat,name):
df_w = regular.pivot_table(index="winning_team_id", columns="season", values=f'losing_team_{stat}', aggfunc=np.sum)
df_l = regular.pivot_table(index="losing_team_id", columns="season", values=f'winning_team_{stat}', aggfunc=np.sum)
df_for = df_w + df_l
df_for.reset_index(inplace=True)
df_for_melt = df_for.melt(id_vars=["winning_team_id"])
df_for_melt.columns=["team_id", "season",name]
return pd.merge(seed_and_names, df_for_melt, how="left", on=["season","team_id"])
seed_and_names = stat_against(seed_and_names,'score','points_against')
seed_and_names = stat_against(seed_and_names,'field_goals','fg_against')
seed_and_names = stat_against(seed_and_names,'field_goals_attempted','fga_against')
seed_and_names = stat_against(seed_and_names,'three_points','3pm_against')
seed_and_names = stat_against(seed_and_names,'three_points_attempted','3pa_against')
seed_and_names = stat_against(seed_and_names,'free_throws','ft_against')
seed_and_names = stat_against(seed_and_names,'free_throws_attempted','fta_against')
seed_and_names = stat_against(seed_and_names,'offensive_rebounds','off_rebounds_against')
seed_and_names = stat_against(seed_and_names,'defensive_rebounds','def_rebounds_against')
seed_and_names = stat_against(seed_and_names,'assists','assists_against')
seed_and_names = stat_against(seed_and_names,'steals','steals_against')
seed_and_names = stat_against(seed_and_names,'blocks','blocks_against')
seed_and_names = stat_against(seed_and_names,'turnovers','turnovers_against')
seed_and_names = stat_against(seed_and_names,'personal_fouls','fouls_against')
regular_season_total = seed_and_names.reset_index().drop(columns=["index"])
y_test = regular_season_total[regular_season_total["season"]>=2015]["final_four"]
y_train = regular_season_total[regular_season_total["season"]<2015]["final_four"]
Note, I ran into error with Wichita St. since they were undefeated, so we need avoid indexing issues by imputing with 0.
regular_season_total.fillna(0,inplace=True)
X = regular_season_total[['season', 'seed', 'team_id', 'team_name', 'conference_code', 'points_for', 'points_against', 'fg_for', 'fg_against',
'3pm_for', '3pm_against', 'fga_for', 'fga_against', '3pa_for',
'3pa_against', 'ft_for', 'ft_against', 'fta_for', 'fta_against',
'off_rebounds_for', 'off_rebounds_against', 'def_rebounds_for',
'def_rebounds_against', 'assists_for', 'assists_against', 'steals_for',
'steals_against', 'blocks_for', 'blocks_against', 'turnovers_for',
'turnovers_against', 'fouls_for', 'fouls_against']]
X_train = X.query("season<2015")
X_test = X.query("season>=2015")
mapper = DataFrameMapper([
(['season'],None),
(['seed'],None),
(['team_id'],None),
(['conference_code'],LabelBinarizer()),
(['points_for'],None),
(['points_against'],None),
(['fg_for'],None),
(['fg_against'],None),
(['3pm_for'],None),
(['3pm_against'],None),
(['fga_for'],None),
(['fga_against'],None),
(['3pa_for'],None),
(['3pa_against'],None),
(['ft_for'],None),
(['ft_against'],None),
(['fta_for'],None),
(['fta_against'],None),
(['off_rebounds_for'],None),
(['off_rebounds_against'],None),
(['def_rebounds_for'],None),
(['def_rebounds_against'],None),
(['assists_for'],None),
(['assists_against'],None),
(['steals_for'],None),
(['steals_against'],None),
(['blocks_for'],None),
(['blocks_against'],None),
(['turnovers_for'],None),
(['turnovers_against'],None),
(['fouls_for'],None),
(['fouls_against'],None)
],df_out=True)
Z_train = mapper.fit_transform(X_train)
Z_test = mapper.transform(X_test)
# Remember, you never want to fit because if you see something you never saw before (e.g. emoji)
# then it will be 'labelbinarized', when it fact it should be treated as something diff instead
# Fix shape problems
# list(zip(Z_train.columns,Z_test.columns)) # there are less conferences in 2015 -> 2018
# mid_cont, pac_ten are in Z_train but not in Z_test
HTML(teams[teams["conference_code"]=="mid_cont"].drop_duplicates(subset="season").to_html(classes="table table-responsive table-striped table-bordered"))
# mid_cont only occured 2004->2007
# Chicago moved from mid_cont to gwc in 2010
season | team_id | team_name | conference_code | |
---|---|---|---|---|
14628 | 2004 | 1147 | Centenary | mid_cont |
14629 | 2005 | 1147 | Centenary | mid_cont |
14630 | 2006 | 1147 | Centenary | mid_cont |
14631 | 2007 | 1147 | Centenary | mid_cont |
16279 | 2003 | 1152 | Chicago St | mid_cont |
HTML(teams[teams["conference_code"]=="pac_ten"].drop_duplicates(subset="season").to_html(classes="table table-responsive table-striped table-bordered"))
# only occured 2003->2011;
# in 2012, Arizona played in pac_twelve
season | team_id | team_name | conference_code | |
---|---|---|---|---|
3601 | 2003 | 1112 | Arizona | pac_ten |
3602 | 2004 | 1112 | Arizona | pac_ten |
3603 | 2005 | 1112 | Arizona | pac_ten |
3604 | 2006 | 1112 | Arizona | pac_ten |
3605 | 2007 | 1112 | Arizona | pac_ten |
3606 | 2008 | 1112 | Arizona | pac_ten |
3607 | 2009 | 1112 | Arizona | pac_ten |
3608 | 2010 | 1112 | Arizona | pac_ten |
3609 | 2011 | 1112 | Arizona | pac_ten |
# let's be more verbose and label mid_cont in pac_ten columns in test, with 0 so they're the same shape
Z_test["conference_code_pac_ten"] = pd.DataFrame(np.zeros((Z_test.shape[0],1)))
Z_test["conference_code_mid_cont"] = pd.DataFrame(np.zeros((Z_test.shape[0],1)))
# Showing up as NaN -> need to make zero
Z_test["conference_code_pac_ten"].fillna(0,inplace=True)
Z_test["conference_code_mid_cont"].fillna(0,inplace=True)
Let's quickly spin up a naive baseline/benchmark model & naive out the door. Then, in subsequent steps, we can feature engineering and model to improve our score!
From FiveThirtyEight, 1 seeds only had a 35%-52% of reaching the final four.
Let's see how how our naive model's prediction fared.
# Baseline logistic
# Step 1: Instantiate our model.
logreg_baseline = LogisticRegression(random_state=8)
# Step 2: Fit our model.
logreg_baseline.fit(Z_train,y_train)
# Step 3 (part 1): Generate prediction values
print(f'Number of teams making it to final four: {sum(logreg_baseline.predict(Z_train))}')
# Step 3 (part 2): Generate predictions/probabilities
logreg_baseline.predict_proba(Z_train)[:,1]
# Step 4: Score the model:
print(f' Logreg train accuracy: {logreg_baseline.score(Z_train,y_train)}')
print(f' Logreg test accuracy: {logreg_baseline.score(Z_test,y_test)}')
Number of teams making it to final four: 18.0
Logreg train accuracy: 0.9444444444444444
Logreg test accuracy: 0.9558823529411765
roc_auc_score(y_test, logreg_baseline.predict_proba(Z_test)[:,1])
0.7300347222222222
Since 73% > 50%, the AUC tells us that this classifier is better than a no better than a 'no information' classifer. However there may be room for improvement.
To get a better evaluation of our baseline model, let's better represent the minority class (final four) with more signal, via over sampling, i.e. randomly sampling with replacement from available samples. I prefer this over SMOTE (Synthetic) because I would like to keep track of the number of games played, and to also keep the years and seeds an int.
# I added final_four to this mapper so I can split after, preserve order rather than randomly splitting
mapper_all = DataFrameMapper([
(['season'], None),
(['seed'], None),
(['team_id'], None),
(['conference_code'], LabelBinarizer()),
(['final_four'],None),
(['points_for'],None),
(['points_against'],None),
(['fg_for'],None),
(['fg_against'],None),
(['3pm_for'],None),
(['3pm_against'],None),
(['fga_for'],None),
(['fga_against'],None),
(['3pa_for'],None),
(['3pa_against'],None),
(['ft_for'],None),
(['ft_against'],None),
(['fta_for'],None),
(['fta_against'],None),
(['off_rebounds_for'],None),
(['off_rebounds_against'],None),
(['def_rebounds_for'],None),
(['def_rebounds_against'],None),
(['assists_for'],None),
(['assists_against'],None),
(['steals_for'],None),
(['steals_against'],None),
(['blocks_for'],None),
(['blocks_against'],None),
(['turnovers_for'],None),
(['turnovers_against'],None),
(['fouls_for'],None),
(['fouls_against'],None)
],df_out=True)
Z = mapper_all.fit_transform(regular_season_total)
y = Z["final_four"]
X = Z[['season', 'seed', 'team_id', 'conference_code_a_sun',
'conference_code_a_ten', 'conference_code_aac', 'conference_code_acc',
'conference_code_aec', 'conference_code_big_east',
'conference_code_big_sky', 'conference_code_big_south',
'conference_code_big_ten', 'conference_code_big_twelve',
'conference_code_big_west', 'conference_code_caa',
'conference_code_cusa', 'conference_code_horizon',
'conference_code_ivy', 'conference_code_maac', 'conference_code_mac',
'conference_code_meac', 'conference_code_mid_cont',
'conference_code_mvc', 'conference_code_mwc', 'conference_code_nec',
'conference_code_ovc', 'conference_code_pac_ten',
'conference_code_pac_twelve', 'conference_code_patriot',
'conference_code_sec', 'conference_code_southern',
'conference_code_southland', 'conference_code_summit',
'conference_code_sun_belt', 'conference_code_swac',
'conference_code_wac', 'conference_code_wcc',
'points_for', 'points_against', 'fg_for', 'fg_against', '3pm_for',
'3pm_against', 'fga_for', 'fga_against', '3pa_for', '3pa_against',
'ft_for', 'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
'assists_for', 'assists_against', 'steals_for', 'steals_against',
'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
'fouls_for', 'fouls_against']]
random_sampler = RandomOverSampler(random_state=8)
X_resampled, y_resampled = random_sampler.fit_resample(X, y)
print(X_resampled.shape)
print(y_resampled.shape) # there are 936 final four rows and 936 non-final four
print(sum(y_resampled))
(1872, 65)
(1872,)
936.0
Let's recalculate accuracy and AUC
df = pd.concat([pd.DataFrame(y_resampled),pd.DataFrame(X_resampled)],axis=1)
df.fillna(0,inplace=True)
df.columns = ['final_four','season', 'seed', 'team_id', 'conference_code_a_sun',
'conference_code_a_ten', 'conference_code_aac', 'conference_code_acc',
'conference_code_aec', 'conference_code_big_east',
'conference_code_big_sky', 'conference_code_big_south',
'conference_code_big_ten', 'conference_code_big_twelve',
'conference_code_big_west', 'conference_code_caa',
'conference_code_cusa', 'conference_code_horizon',
'conference_code_ivy', 'conference_code_maac', 'conference_code_mac',
'conference_code_meac', 'conference_code_mid_cont',
'conference_code_mvc', 'conference_code_mwc', 'conference_code_nec',
'conference_code_ovc', 'conference_code_pac_ten',
'conference_code_pac_twelve', 'conference_code_patriot',
'conference_code_sec', 'conference_code_southern',
'conference_code_southland', 'conference_code_summit',
'conference_code_sun_belt', 'conference_code_swac',
'conference_code_wac', 'conference_code_wcc', 'points_for',
'points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
'assists_for', 'assists_against', 'steals_for', 'steals_against',
'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
'fouls_for', 'fouls_against']
# Let's split on the years again
y_test = df[df["season"]>=2015]["final_four"]
y_train = df[df["season"]<2015]["final_four"]
# 379/1872 so about 20%/80% split which is fine, I prefer the interpretability and preserve meaning.
# We are using 2003 -> 2014 data to predict and test 2015, 2016, 2017.
X = df.loc[:,"season":"fouls_against"]
X_test = X[X["season"]>=2015]
X_train = X[X["season"]<2015]
print(f' X_train shape {X_train.shape}')
print(f' y_train shape {y_train.shape}')
print("\n")
print(f' X_test shape {X_test.shape}')
print(f' y_test shape {y_test.shape}')
X_train shape (1493, 65)
y_train shape (1493,)
X_test shape (379, 65)
y_test shape (379,)
# Baseline logistic with balanced classes
# Step 1: Instantiate our model.
logreg_baseline = LogisticRegression(random_state=8)
# Step 2: Fit our model.
logreg_baseline.fit(X_train,y_train)
# Step 3 (part 1): Generate prediction values
print(f'Number of teams making it to final four: {sum(logreg_baseline.predict(X_train))}')
# Step 3 (part 2): Generate predictions/probabilities
logreg_baseline.predict_proba(X_train)[:,1]
# Step 4: Score the model:
print(f' Logreg train accuracy: {cross_val_score(logreg_baseline, X_train, y_train, cv=5).mean()}')
print(f' Logreg test accuracy: {cross_val_score(logreg_baseline, X_test, y_test, cv=5).mean()}')
Number of teams making it to final four: 858.0
Logreg train accuracy: 0.8961679222548786
Logreg test accuracy: 0.9473593073593072
roc_auc_score(y_test, logreg_baseline.predict_proba(X_test)[:,1])
0.7704155525846702
print(f' seed coeff {np.exp(-0.562391)}')
print(f' acc coeff {np.exp(-1.32)}')
seed coeff 0.5698449344437365
acc coeff 0.26713530196585034
The seed is very important. Remember, need to take exponential. If your seed increases by 1, your likelihood of reaching final four is 0.57x less. The individual team stats are less important, as I suspect the seed already captures a lot of that information. Season and team_id obviously do not matter as shown by 0 coefficient.
And what conference you play in matters (acc, big_twelve or big_ten). If you play in acc, your likelihood of reaching final four is 0.26x less, this is probably explaining for the teams who are not top tier.
HTML(pd.DataFrame(list(zip(X.columns,logreg_baseline.coef_.T[:,0]))).head(3).to_html(classes="table table-responsive table-striped table-bordered"))
0 | 1 | |
---|---|---|
0 | season | -0.00 |
1 | seed | -0.56 |
2 | team_id | -0.00 |
The AUC scores are more representative of the model's true performance. This basic logistic model was able to be accurate 90% of the time with training data and 95% with test data. Therefore, it's a good benchmark, with no strong indication of over-fitting, but let's try to improve through feature engineering (advanced basketball metrics!) and regularization.
Notice how the AUC score has improved now. With this model (trained on balanced classes), we are materially better than a 'no information' classifier. It's closer to 1 so it does not display signals of imbalanced classes, nor overwhelming amount of false positives vs false negatives. We'll revisit this later in model evaluation.
Let us perform EDA to better understand our data.
Final_four is positively correlated to fg_for (higher fg_for -> more likely to be in final four)
Some of these independent variables are highly correlated, so I may not want to use all of them to reduce over-fitting.
quant = df[['final_four', 'season', 'seed', 'team_id', 'points_for',
'points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
'assists_for', 'assists_against', 'steals_for', 'steals_against',
'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
'fouls_for', 'fouls_against']]
stats = df[['points_for','points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
'assists_for', 'assists_against', 'steals_for', 'steals_against',
'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
'fouls_for', 'fouls_against']]
fig, ax = plt.subplots(figsize=(8, 8))
corr = stats.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, xticklabels = True, yticklabels = True)
HTML(quant.describe().to_html(classes='table table-responsive table-striped'))
final_four | season | seed | team_id | points_for | points_against | fg_for | fg_against | 3pm_for | 3pm_against | fga_for | fga_against | 3pa_for | 3pa_against | ft_for | ft_against | fta_for | fta_against | off_rebounds_for | off_rebounds_against | def_rebounds_for | def_rebounds_against | assists_for | assists_against | steals_for | steals_against | blocks_for | blocks_against | turnovers_for | turnovers_against | fouls_for | fouls_against | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 | 1872.00 |
mean | 0.50 | 2009.95 | 6.05 | 1290.26 | 2370.96 | 2049.25 | 836.08 | 729.79 | 213.76 | 192.89 | 1802.26 | 1780.15 | 588.58 | 585.63 | 485.04 | 396.78 | 686.40 | 577.32 | 379.41 | 351.31 | 788.69 | 688.99 | 469.75 | 376.67 | 224.75 | 193.06 | 133.78 | 101.09 | 403.02 | 446.56 | 553.43 | 605.75 |
std | 0.50 | 4.34 | 4.79 | 99.57 | 310.72 | 254.79 | 115.35 | 95.10 | 47.18 | 35.78 | 233.19 | 229.30 | 116.29 | 103.06 | 81.73 | 74.40 | 111.37 | 106.74 | 73.81 | 60.89 | 109.08 | 94.69 | 78.42 | 58.24 | 49.50 | 34.69 | 50.44 | 23.60 | 63.21 | 76.71 | 79.72 | 78.79 |
min | 0.00 | 2003.00 | 1.00 | 1102.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
25% | 0.00 | 2006.00 | 2.00 | 1210.00 | 2226.00 | 1957.00 | 779.00 | 695.00 | 184.00 | 171.00 | 1708.50 | 1680.00 | 525.00 | 534.00 | 446.00 | 357.75 | 637.00 | 518.75 | 337.00 | 319.00 | 733.00 | 646.00 | 428.00 | 347.75 | 191.00 | 173.00 | 101.00 | 87.00 | 376.00 | 404.00 | 522.00 | 574.00 |
50% | 0.50 | 2010.00 | 4.00 | 1277.00 | 2403.00 | 2061.00 | 838.00 | 733.00 | 212.00 | 192.00 | 1824.00 | 1802.00 | 595.00 | 588.00 | 488.00 | 394.50 | 698.00 | 582.00 | 376.00 | 355.00 | 798.00 | 698.00 | 467.00 | 375.00 | 224.00 | 194.00 | 123.00 | 102.00 | 398.00 | 444.00 | 561.00 | 615.00 |
75% | 1.00 | 2014.00 | 10.00 | 1386.00 | 2552.00 | 2164.00 | 901.00 | 776.00 | 243.00 | 213.00 | 1916.00 | 1903.00 | 662.00 | 646.00 | 525.00 | 442.00 | 746.00 | 641.00 | 425.00 | 386.25 | 851.00 | 742.00 | 512.00 | 406.00 | 252.00 | 211.00 | 159.25 | 114.00 | 440.00 | 493.00 | 599.00 | 647.00 |
max | 1.00 | 2017.00 | 16.00 | 1463.00 | 3016.00 | 2657.00 | 1113.00 | 999.00 | 342.00 | 335.00 | 2245.00 | 2248.00 | 923.00 | 899.00 | 696.00 | 648.00 | 1020.00 | 921.00 | 555.00 | 510.00 | 1021.00 | 956.00 | 709.00 | 564.00 | 402.00 | 291.00 | 299.00 | 179.00 | 584.00 | 695.00 | 795.00 | 795.00 |
While the team's season statistics are useful, it may be more predictive to understand a team's performance on a per game basis, to understand how it's winning or losing its games.
The main categorical variable is conferences. It appears to also be predictive from the baseline, representing the strength of the conference. Some conferences are generally very strong and play stronger competition than others.
See that big_east, acc, big_ten, sec, big_twelve all have 5+ teams that have made it to the final four. These are the conferences with the best teams.
HTML(pd.DataFrame(regular_season_total.groupby("conference_code")["final_four"].sum()).sort_values(by="final_four",ascending=False)[0:5].to_html(classes='table table-responsive table-striped'))
final_four | |
---|---|
conference_code | |
big_east | 11.00 |
acc | 10.00 |
big_ten | 10.00 |
sec | 9.00 |
big_twelve | 6.00 |
Some other observations include: * No seeds greater than 11 have made it to final four * Teams which make the final four, in the regular season score more, make more FG, give up less FTs, rebound more, block more, foul less, ... all the attributes of a great basketball team
Let's keep the stat metrics on the same scale, on a per game basis rather than standardizing to keep it interpretable. First we need to count the number of games for each team.
regular["game_won"] = np.ones((regular.shape[0], 1))
df_gw = regular.groupby(["season","winning_team_id"]).count().reset_index()[["season", "winning_team_id","game_won"]]
df_gw.columns = ["season", "team_id","game_won"]
regular["game_lost"] = np.ones((regular.shape[0], 1))
df_gl = regular.groupby(["season","losing_team_id"]).count().reset_index()[["season", "losing_team_id","game_lost"]]
df_gl.columns = ["season", "team_id","game_lost"]
df = pd.merge(df,df_gl,how="left",on=['season','team_id'])
df = pd.merge(df,df_gw,how="left",on=['season','team_id'])
df["game_total"] = df["game_won"] + df["game_lost"]
df.fillna(0,inplace=True)
# Effective gf
# This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
df["efg%"]=(df["fg_for"] + 0.5*df["3pm_for"]) / df["fga_for"]
df.loc[:,'points_for':'fouls_against'] = df[['points_for', 'points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
'assists_for', 'assists_against', 'steals_for', 'steals_against',
'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
'fouls_for', 'fouls_against']].apply(lambda x: x / df.game_total)
Let's use some advanced metrics from the NBA. Note, opponent % are kept as to gauge defense as well.
# Now, instead of games_won and games_lost, let's combine this into win_%
df['win_%']=df["game_won"]/df["game_total"]
# +/- point differential aka margin of victory
df["margin_of_victory"]=df["points_for"]-df["points_against"]
# Assist to turnover ratio: this measures your ability to care of possessions and pass the ball, as a team.
df["ast_to_ratio"] = df["assists_for"]/df["turnovers_for"]
# Let's convert field goals, threes and fts into %. Will reduce collinearity and also simplify interpretation.
df['fg%_for'] = df['fg_for']/df['fga_for']
df['fg%_against'] = df['fg_against']/df['fga_against']
df['3p%_for'] = X['3pm_for']/X['3pa_for']
df['3p%_against'] = df['3pm_against']/df['3pa_against']
df['ft%_for'] = df['ft_for']/X['fta_for']
df['ft%_against'] = df['ft_against']/df['fta_against']
# drop games won, lost, its redundant with win_%
df = df.drop(["game_won","game_lost"],axis=1)
# Some games have no data, probably from sampling these teams don't exist. Let's just drop for now since I'm getting NaN.
df.dropna(inplace=True)
df.to_csv("./data/df.csv")
df.shape # 1856 observations
(1856, 77)
I will keep the original features and use regularization methods in the next section.
I prefer Lasso as a shrinkage method, to narrow down on the most important features, since our baseline model was very accurate already.
y_test = df[df["season"]>=2015]["final_four"]
y_train = df[df["season"]<2015]["final_four"]
X_train = df.query("season<2015").loc[:,"season":"ft%_against"]
X_test = df.query("season>=2015").loc[:,"season":"ft%_against"]
print(X_train.shape) #80%
print(X_test.shape) #20%
# Instantiate Model
logreg_lasso_1 = LogisticRegression(penalty='l1', C=1)
# Fit model.
logreg_lasso_1.fit(X_train, y_train)
print(f'Logistic Regression Intercept: {logreg_lasso_1.intercept_}')
print(f'Logistic Regression Coefficient: {logreg_lasso_1.coef_}')
print("\n")
# Generate prediction values
print(f'Number of teams making it to final four:{sum(logreg_lasso_1.predict(X_train))}')
print("\n")
# Generate predictions/probabilities
print(logreg_lasso_1.predict_proba(X_train)[:,1])
print("\n")
(1492, 76)
(364, 76)
Logistic Regression Intercept: [0.]
Logistic Regression Coefficient: [[-0.0012153 -0.6837633 0.00072333 0. -2.44191693 2.48408599
-0.964152 0. -0.31075531 0. 0. 0.80899047
-1.47521544 0. 6.28137961 0.93520549 5.1251736 0.
0. 0. 0. 0. 1.91588091 -2.56345391
0. 0. 0. -2.5507449 0. 0.24535897
0. 0. 0. 0. 0. 0.
-3.40287112 0.09635222 0.03212444 0.15745544 0. -1.68313285
1.38608677 -0.33650906 0.20264111 0.48025645 -0.49238313 0.19572923
-0.4448577 -0.24921486 0.67154919 0.40774297 -0.13740773 -0.16521916
0.01779028 -0.45970706 0.19248742 0.21530272 -0.02207059 -0.01640808
0.23174252 0.26819881 -0.35574917 0. -0.47109025 0.24791855
0. -2.02168244 0.44660712 0. 0. 0.
0. 0. 0. 0. ]]
Number of teams making it to final four:832.0
[0.45132789 0.8439776 0.88002624 ... 0.97316549 0.7337679 0.79253303]
lasso_coef = pd.DataFrame(list(zip(X_train.columns,logreg_lasso_1.coef_.T[:,0])))
lasso_coef.columns = ["variable","coef"]
lasso_coef["coef_abs"]=abs(lasso_coef["coef"])
HTML(lasso_coef.sort_values(by="coef_abs",ascending=False)[0:25].to_html(classes='table table-responsive table-striped'))
variable | coef | coef_abs | |
---|---|---|---|
14 | conference_code_caa | 6.28 | 6.28 |
16 | conference_code_horizon | 5.13 | 5.13 |
36 | conference_code_wcc | -3.40 | 3.40 |
23 | conference_code_mwc | -2.56 | 2.56 |
27 | conference_code_pac_twelve | -2.55 | 2.55 |
5 | conference_code_aac | 2.48 | 2.48 |
4 | conference_code_a_ten | -2.44 | 2.44 |
67 | win_% | -2.02 | 2.02 |
22 | conference_code_mvc | 1.92 | 1.92 |
41 | 3pm_for | -1.68 | 1.68 |
12 | conference_code_big_twelve | -1.48 | 1.48 |
42 | 3pm_against | 1.39 | 1.39 |
6 | conference_code_acc | -0.96 | 0.96 |
15 | conference_code_cusa | 0.94 | 0.94 |
11 | conference_code_big_ten | 0.81 | 0.81 |
1 | seed | -0.68 | 0.68 |
50 | fta_against | 0.67 | 0.67 |
46 | 3pa_against | -0.49 | 0.49 |
45 | 3pa_for | 0.48 | 0.48 |
64 | fouls_against | -0.47 | 0.47 |
55 | assists_for | -0.46 | 0.46 |
68 | margin_of_victory | 0.45 | 0.45 |
48 | ft_against | -0.44 | 0.44 |
51 | off_rebounds_for | 0.41 | 0.41 |
62 | turnovers_against | -0.36 | 0.36 |
Outside of conferences, win_% is the most important predictor. Although it's strange to see that it has a negative sign, this is probably the case because schools that make to to the final four actually have lower win % than schools who do not make it to the final four because they face tougher competition (57% avg vs 72% avg). This supports the argument that which conference you play for is very important in predictive power.
Again, your strongest conferences, are the big_twelve, acc, big_ten. If you play in acc, your likelihood of reaching final four is very low, because you will likely be dominated by the top tier schools in those conferences.
In terms of regular season states, 3PM for and against is important, as the game continues to move shooting more threes, as it's been proven to be an effective strategy.
Lastly, seed is not to be overlooked as it's a good composite score for a team's strength, capturing a lot of regular season performance. As your seed increases by 1, your likelihood of reaching final four is 0.50x less. This makes sense, as a 3 seed only has a ~12.5% (0.5^3) of making to final four! This is consistent with FiveThirtyEight's forecasts. Purdue, Houston, & Texas Tech had 10-14% of making it to the Final four.
Similar to the NBA, margin of victory is a predictive variable though with a positive correlation. In the NBA, there is more parity since each team plays one another at least twice in a season, so individual team stats are more important than seed & conference.
# Score the model - accuracy
print(f' Logreg train accuracy: {cross_val_score(logreg_lasso_1, X_train, y_train, cv=5).mean()}')
print("\n")
print(f' Logreg test accuracy: {cross_val_score(logreg_lasso_1, X_test, y_test, cv=5).mean()}')
print("\n")
# Area under the curve
print(f' Area under the curve: {roc_auc_score(y_test, logreg_lasso_1.predict_proba(X_test)[:,1])}')
Logreg train accuracy: 0.8981365433947868
Logreg test accuracy: 0.9203453453453454
Area under the curve: 0.8046269379844961
Metric | Baseline | Logistic Regression, Lasso = 1 |
---|---|---|
Train Accuracy | 94% | 90% |
Test Accuracy | 96% | 92% |
Area under the curve | 77% | 80% |
Even with fewer variables, this model is arguably as performant as the original. It does not overfit, and even improves the AUC score by 3%.
# create data frame of true values and predicted probabilities on test set
pred_proba = [i[1] for i in logreg_lasso_1.predict_proba(X_test)]
pred_df = pd.DataFrame({'true_values': y_test,
'pred_probs':pred_proba})
# Create figure.
plt.figure(figsize = (8,8))
# Create threshold values.
thresholds = np.linspace(0, 1, 200)
# Define function to calculate sensitivity. (True positive rate.)
def TPR(df, true_col, pred_prob_col, threshold):
true_positive = df[(df[true_col] == 1) & (df[pred_prob_col] >= threshold)].shape[0]
false_negative = df[(df[true_col] == 1) & (df[pred_prob_col] < threshold)].shape[0]
return true_positive / (true_positive + false_negative)
# Define function to calculate 1 - specificity. (False positive rate.)
def FPR(df, true_col, pred_prob_col, threshold):
true_negative = df[(df[true_col] == 0) & (df[pred_prob_col] <= threshold)].shape[0]
false_positive = df[(df[true_col] == 0) & (df[pred_prob_col] > threshold)].shape[0]
return 1 - (true_negative / (true_negative + false_positive))
# Calculate sensitivity & 1-specificity for each threshold between 0 and 1.
tpr_values = [TPR(pred_df, 'true_values', 'pred_probs', prob) for prob in thresholds]
fpr_values = [FPR(pred_df, 'true_values', 'pred_probs', prob) for prob in thresholds]
# Plot ROC curve.
plt.plot(fpr_values, # False Positive Rate on X-axis
tpr_values, # True Positive Rate on Y-axis
label='ROC Curve')
# Plot baseline. (Perfect overlap between the two populations.)
plt.plot(np.linspace(0, 1, 200),
np.linspace(0, 1, 200),
label='baseline',
linestyle='--')
# Label axes.
plt.title(f'ROC Curve with AUC = {round(roc_auc_score(pred_df["true_values"], pred_df["pred_probs"]),2)}', fontsize=20)
plt.ylabel('Sensitivity', fontsize=15)
plt.xlabel('1 - Specificity', fontsize=15)
# Create legend.
plt.legend(fontsize=16);
cm = confusion_matrix(y_test, logreg_lasso_1.predict(X_test))
cm_df = pd.DataFrame(data=cm, columns=['predicted negative', 'predicted positive'], index=['actual negative', 'actual positive'])
HTML(cm_df.to_html(classes='table table-responsive table-striped'))
predicted negative | predicted positive | |
---|---|---|
actual negative | 167 | 25 |
actual positive | 69 | 103 |
There are more false negatives than false positives (predicting no final four, when it is final four). The shape of the ROC Curve is due to the random over sampling.
Let's fine tune our hyperparameter of alpha with GridSearchCV to see if we can improve our model further.
log_params = {
'penalty':["l1", "l2"],
'C':list(np.linspace(0.01, 5, 10))
}
log_gridsearch = GridSearchCV(LogisticRegression(random_state=8), log_params, cv=5, verbose=1, n_jobs=2)
log_gridsearch = log_gridsearch.fit(X_train, y_train)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 36.1s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed: 1.4min finished
print(log_gridsearch.best_score_)
log_gridsearch.best_params_
0.9175603217158177
{'C': 4.445555555555556, 'penalty': 'l1'}
L1 penalty outperforms L2 penalty. Let's gragh the relationship between the strength of C (lambda) and test accuracy.
plt.figure(figsize = (8,8))
lst_of_c = [c["C"] for c in pd.DataFrame(log_gridsearch.cv_results_)["params"]]
mean_test_scores = pd.DataFrame(log_gridsearch.cv_results_)["mean_test_score"]
plt.plot(lst_of_c,
mean_test_scores)
plt.title(f'Penalty Strength vs Test Accuracy: {round(log_gridsearch.best_score_,3)}', fontsize=15)
plt.ylabel('C', fontsize=10);
plt.xlabel('%', fontsize=10);
print(log_gridsearch.best_score_)
log_gridsearch.best_params_
0.9175603217158177
{'C': 4.445555555555556, 'penalty': 'l1'}
# Instantiate Model
logreg_best = LogisticRegression(penalty='l1', C=log_gridsearch.best_params_["C"])
# Fit model.
logreg_best.fit(X_train, y_train)
# Generate prediction values
yhat_train = logreg_lasso_1.predict(X_train)
yhat_test = logreg_lasso_1.predict(X_test)
print("\n")
# Generate predictions/probabilities
yhat_train_proba = logreg_best.predict_proba(X_train)[:,1]
yhat_test_proba = logreg_best.predict_proba(X_test)[:,1]
# Score the model - accuracy
print(f' Logreg train accuracy: {cross_val_score(logreg_best, X_train, y_train, cv=5).mean()}')
print("\n")
print(f' Logreg test accuracy: {cross_val_score(logreg_best, X_test, y_test, cv=5).mean()}')
print("\n")
# Area under the curve
print(f' Area under the curve: {roc_auc_score(y_test, logreg_best.predict_proba(X_test)[:,1])}')
Logreg train accuracy: 0.9168948657714342
Logreg test accuracy: 0.9341591591591591
Area under the curve: 0.786125242248062
Metric | Baseline | Logistic Regression, Lasso = 1 | Logistic Regression, Lasso = Optimal |
---|---|---|---|
Train Accuracy | 94% | 90% | 92% |
Test Accuracy | 96% | 92% | 93% |
Area under the curve | 77% | 80% | 79% |
pred_15_17 = pd.DataFrame(zip(X_test[["season","team_id"]],yhat_test))
pred_15_17 = pd.DataFrame(
{"season": X_test["season"],
"team_id": X_test["team_id"],
"final_four": y_test,
"yhat": yhat_test
}
)
results = pd.merge(pred_15_17,teams, how="left",on=["season","team_id"]).drop_duplicates().reset_index()
results.sort_values(by="final_four",ascending=False).drop("index",axis=1)
HTML(results.query('final_four==1').to_html(classes='table table-striped table-responsive'))
index | season | team_id | final_four | yhat | team_name | conference_code | |
---|---|---|---|---|---|---|---|
6 | 155 | 2015.00 | 1277.00 | 1.00 | 0.00 | Michigan St | big_ten |
17 | 421 | 2015.00 | 1181.00 | 1.00 | 1.00 | Duke | acc |
50 | 1077 | 2015.00 | 1458.00 | 1.00 | 1.00 | Wisconsin | big_ten |
67 | 1440 | 2016.00 | 1314.00 | 1.00 | 1.00 | North Carolina | acc |
94 | 2048 | 2016.00 | 1393.00 | 1.00 | 0.00 | Syracuse | acc |
102 | 2204 | 2016.00 | 1437.00 | 1.00 | 1.00 | Villanova | big_east |
119 | 2498 | 2016.00 | 1328.00 | 1.00 | 0.00 | Oklahoma | big_twelve |
141 | 2997 | 2017.00 | 1376.00 | 1.00 | 0.00 | South Carolina | sec |
153 | 3244 | 2017.00 | 1211.00 | 1.00 | 1.00 | Gonzaga | wcc |
171 | 3556 | 2017.00 | 1332.00 | 1.00 | 0.00 | Oregon | pac_twelve |
186 | 3857 | 2017.00 | 1314.00 | 1.00 | 1.00 | North Carolina | acc |
Overall, I was able to improve from the baseline logistic model by and 2% in ROC curve. What I am impressed about is that the best model via Lasso is able to predict with less variables, while also maintaining interpretability (no wild transformations), with almost the same level of accuracy for test.
In summary:
Win_% is the most important predictor. Although it's strange to see that it has a negative sign, this is probably the case because schools that make to to the final four actually have lower win % than schools who do not make it to the final four because they face tougher competition (57% avg vs 72% avg). This supports the argument that which conference you play for is very important in predictive power.
Again, your strongest conferences, are the big_twelve, acc, big_ten. If you play in acc, your likelihood of reaching final four is very low, because you will likely be dominated by the top tier schools in those conferences.
In terms of regular season states, 3PM for and against is important, as the game continues to move shooting more threes, as it's been proven to be an effective strategy.
Lastly, seed is not to be overlooked as it's a good composite score for a team's strength, capturing a lot of regular season performance. As your seed increases by 1, your likelihood of reaching final four is 0.5x less. This makes sense... a 3 seed only has a ~12.5% (0.5^3) of making to final four! This is consistent with FiveThirtyEight's forecasts. Purdue, Houston, & Texas Tech had 10-14% of making it to the Final four.
Similar to the NBA, margin of victory is a predictive variable though with a positive correlation. In the NBA, there is more parity since each team plays one another at least twice in a season, so individual team stats are more important than seed & conference.
My model predicts the probability of a team reaching the final four, primarily based on its regular season data, seed, and conference. I would recommend coaches to develop the three ball ability, as it continues dominate in the NBA and at the college level, and to outperform in the regular season to improve your seeding, to ultimately improve your chances in moving to the final four, through playing weaker opponents.
Assumptions
Research Links
This was a very fun exploration and may plan to revisit this project in a future blog post.
Ideas: