18 April 2019

March Madness

How likely is a team to make the Final Four of the NCAA Tournament?

Introduction

Each year, close to $4 billion is wagered on the NCAA Division 1 men's basketball tournament. Most of that money is wagered where the objective is to correctly predict winners of each game, with emphasis on the last four teams remaining (the Final Four).

In this project, my motivation is the following:

Based on a college's regular season performance and seeding information, can I predict whether or not they will reach the final four?

What are the variables that are correlated to predicting teams that make it to the final four? As a corollary, my model will also output the associated probabilities of making it to the final 4. Am I able to outperform a naive model? As even the sport pundits will tell you, since 2008, 53% of the time, at least 2 No. 1 seeds make the final four. So just by choosing two No 1. seeds, you're half way there.

As a trusted advisor to Coach Krzyzewski (Duke) or Coach Izzo (Michigan State), how would I recommend spending time developing the team?

Or at the very least, how do I improve my 2020 bracket to make some money?

Preprocessing

Import libraries

# data wrangling
import pandas as pd
import numpy as np

# plotting 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
from IPython.display import HTML

# preprocessing & feature engineering
from sklearn.preprocessing import StandardScaler, LabelBinarizer, PolynomialFeatures
from sklearn_pandas import DataFrameMapper, CategoricalImputer, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

# modelling & evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import roc_auc_score, confusion_matrix

# scientific notation off
np.set_printoptions(suppress=True)
pd.options.display.float_format = '{:.2f}'.format

# suppress warnings
from sklearn.exceptions import DataConversionWarning
import warnings
warnings.filterwarnings(action='ignore')

Data

The data spans 2003 to 2017 and is compiled from sports-reference.com.

The data is spread across four files:

regular_season.csv - gamelogs for every regular season game
teams.csv - team_id, names, and conferences
march_madness.csv - gamelogs for each NCAA tournament game
march_madness_seeds.csv - entry seeds for each team (W, X, Y, Z indicate the region)

Understand the data

regular = pd.read_csv("./data/ncaa_data/regular_season.csv")
HTML(regular.tail(3).to_html(classes="table table-responsive table-striped table-bordered"))

	season	day_in_season	winning_team_id	winning_team_score	losing_team_id	losing_team_score	winning_team_field_goals	winning_team_field_goals_attempted	winning_team_three_points	winning_team_three_points_attempted	winning_team_free_throws	winning_team_free_throws_attempted	winning_team_offensive_rebounds	winning_team_defensive_rebounds	winning_team_assists	winning_team_turnovers	winning_team_steals	winning_team_blocks	winning_team_personal_fouls	losing_team_field_goals	losing_team_field_goals_attempted	losing_team_three_points	losing_team_three_points_attempted	losing_team_free_throws	losing_team_free_throws_attempted	losing_team_offensive_rebounds	losing_team_defensive_rebounds	losing_team_assists	losing_team_turnovers	losing_team_steals	losing_team_blocks	losing_team_personal_fouls
76633	2017	132	1348	70	1433	63	24	54	8	20	14	19	9	27	12	6	3	7	18	21	67	4	14	17	22	23	24	8	5	4	1	16
76634	2017	132	1374	71	1153	56	26	52	10	19	9	13	7	27	14	8	2	6	15	19	61	4	24	14	18	17	22	7	7	7	1	13
76635	2017	132	1407	59	1402	53	21	60	1	17	16	19	14	19	5	5	10	3	10	20	48	6	17	7	8	9	27	10	17	1	7	18

Let's interpret the last row. - 1407 = Troy, 1402 = Texas State. - Troy beat Texas State 59 to 53 on 132 day in season which is Sunday, March 2017. (Technically, 2016 - 2017 Season). Notice that all games finish on day 132 in each year.

From the regular season, there are:

32 variables
games from 2003 to 2017
stats per each team, team id
no null values, all ints

teams = pd.read_csv("./data/ncaa_data/teams.csv")
HTML(teams.head(3).to_html(classes="table table-responsive table-striped table-bordered"))

	season	team_id	team_name	conference_code	conference_name
0	2014	1101	Abilene Chr	southland	Southland Conference
1	2015	1101	Abilene Chr	southland	Southland Conference
2	2016	1101	Abilene Chr	southland	Southland Conference

teams = teams.drop(columns=["conference_name"])

From the teams data:

join team name, conference code to easily interpret
drop conference name, it's redundant
conference info from 2003 to 2018
no null values, int and object looks good

mm = pd.read_csv("./data/ncaa_data/march_madness.csv")
HTML(mm.head(3).to_html(classes="table table-responsive table-striped table-bordered"))

	season	day_in_season	winning_team_id	winning_team_score	losing_team_id	losing_team_score
0	1985	136	1116	63	1234	54
1	1985	136	1120	59	1345	58
2	1985	136	1207	68	1250	43

From the march madness data:

1985 to 2017 data about game logs for NCAA tournament
scores, winning team & losing team
use 1985 -> 2017 data to predict 2018
day 134, 135 are for first four, associated with 'a', 'b' seeds, they play to finalize seeding for final 64
e.g. 11 Wake Forest 88, 11 Kansas State 95
first day of round of 64 is 136
last game is 154, which is the national champion
e.g. UNC beat Gonzaga on Day 154 in 2017
Let's drop 1985->2002 data, since I don't have regular season, seed or conference information for those years.

mm = mm[mm["season"]>=2003]
seeds = pd.read_csv("./data/ncaa_data/march_madness_seeds.csv")
HTML(seeds.head(3).to_html(classes="table table-responsive table-striped table-bordered"))

	season	seed	team_id
0	2003	W01	1328
1	2003	W02	1448
2	2003	W03	1393

seed information from 2003 to 2018
there are 68 seeds
Starting in 2011, NCAA tournament starts with 68 teams (e.g. 68 seeds), then dwindles down to 64.
8 lowest seeded teams play in the ‘first four’, and then the winners, come out to be apart of the 64.
merge to use seed as a variable, join on team_id
seed needs to be an int; let's remove the region, it doesn't really matter, what we care about are the raw seeds, treat Y16a and Y16b as 16

seeds["seed"] = seeds["seed"].apply(lambda x: int(x[1:3]))

seed_and_names = pd.merge(seeds, teams, how="left", on=["season","team_id"]).drop_duplicates()
HTML(seed_and_names.head(3).to_html(classes="table table-responsive table-striped table-bordered"))

	season	seed	team_id	team_name	conference_code
0	2003	1	1328	Oklahoma	big_twelve
22	2003	2	1448	Wake Forest	acc
56	2003	3	1393	Syracuse	big_east

I need a label to indicate if the team is in the Final Four
e.g. in 2017, the Final Four were Gonzaga, South Carolina, Oregon and North Carolina
1211, 1376, 1322, 1314 respectively
So, we can see that on day 154, the national championship game was played, and the winner was North Carolina
Moreover, the final four teams are North Carolina vs Oregon, and Gonzaga vs South Carolina
These correspond to games played on 152

HTML(mm[mm["season"]==2017].tail().to_html(classes="table table-responsive table-striped table-bordered"))

	season	day_in_season	winning_team_id	winning_team_score	losing_team_id	losing_team_score	final_four
2112	2017	146	1314	75	1246	73	0
2113	2017	146	1376	77	1196	70	0
2114	2017	152	1211	77	1376	73	1
2115	2017	152	1314	77	1332	76	1
2116	2017	154	1314	71	1211	65	0

We can see the championship game was also played on 154, final four played on 152! Perfect, now we can flag all of these teams as final four = 1

We'll query the winning_team_id and losing_team_id for each of those games to get the final four teams

# 15 seasons, we should get 2*15 = 30 final four games
mm["final_four"]=mm["day_in_season"].apply(lambda x: 1 if x == 152 else 0)

Define X and y

Let's define X and y so we can more easily perform EDA, feature engineering and modeling. First, we'll need a target vector y, with all the teams in each season and whether or not they made it to the Final Four.

Let's work off the seeds_and_names data frame, as those are all of our 996 participating colleges for each season.

seed_and_names["final_four"]=np.zeros(len(seed_and_names)) # zeros

final_four_list = (list(zip(mm.query("final_four==1").season,mm.query("final_four==1").winning_team_id))+
     list(zip(mm.query("final_four==1").season,mm.query("final_four==1").losing_team_id)))

# fill in teams with 1 if final_four team
for season, team_id in final_four_list:
    seed_and_names["final_four"]+=np.where((seed_and_names.season==season) & (seed_and_names.team_id==team_id),1,0)

HTML(seed_and_names.query("final_four==1 & season==2017").to_html(classes="table table-responsive table-striped table-bordered"))

	season	seed	team_id	team_name	conference_code	final_four
23045	2017	7	1376	South Carolina	sec	1.0
23292	2017	1	1211	Gonzaga	wcc	1.0
23604	2017	3	1332	Oregon	pac_twelve	1.0
23905	2017	1	1314	North Carolina	acc	1.0

Great, now we have a final_four which will be our target y. Now we have to aggregate the stats for each team, over all the 76636 games played.

e.g. for South Carolina, we now need the following aggregated stats for, points for, points against, etc. Aggregating opponents stats so we have a sense of offensive and defensive ability because defense wins champions!

For each time a a specific team won, sum up all of their points
For each time a specific team lost, sum of all of their points
Sum up points from won and lost games
Repeat for all the variables of interest

def stat_for(seed_and_names,stat,name):
    df_w = regular.pivot_table(index="winning_team_id", columns="season", values=f'winning_team_{stat}', aggfunc=np.sum)
    df_l = regular.pivot_table(index="losing_team_id", columns="season", values=f'losing_team_{stat}', aggfunc=np.sum)
    df_for = df_w + df_l 
    df_for.reset_index(inplace=True)
    df_for_melt = df_for.melt(id_vars=["winning_team_id"])
    df_for_melt.columns=["team_id", "season",name]
    return pd.merge(seed_and_names, df_for_melt, how="left", on=["season","team_id"])

seed_and_names = stat_for(seed_and_names,'score','points_for')
seed_and_names = stat_for(seed_and_names,'field_goals','fg_for')
seed_and_names = stat_for(seed_and_names,'field_goals_attempted','fga_for')
seed_and_names = stat_for(seed_and_names,'three_points','3pm_for')
seed_and_names = stat_for(seed_and_names,'three_points_attempted','3pa_for')
seed_and_names = stat_for(seed_and_names,'free_throws','ft_for')
seed_and_names = stat_for(seed_and_names,'free_throws_attempted','fta_for')
seed_and_names = stat_for(seed_and_names,'offensive_rebounds','off_rebounds_for')
seed_and_names = stat_for(seed_and_names,'defensive_rebounds','def_rebounds_for')
seed_and_names = stat_for(seed_and_names,'assists','assists_for')
seed_and_names = stat_for(seed_and_names,'steals','steals_for')
seed_and_names = stat_for(seed_and_names,'blocks','blocks_for')
seed_and_names = stat_for(seed_and_names,'turnovers','turnovers_for')
seed_and_names = stat_for(seed_and_names,'personal_fouls','fouls_for')

def stat_against(seed_and_names,stat,name):
    df_w = regular.pivot_table(index="winning_team_id", columns="season", values=f'losing_team_{stat}', aggfunc=np.sum)
    df_l = regular.pivot_table(index="losing_team_id", columns="season", values=f'winning_team_{stat}', aggfunc=np.sum)
    df_for = df_w + df_l 
    df_for.reset_index(inplace=True)
    df_for_melt = df_for.melt(id_vars=["winning_team_id"])
    df_for_melt.columns=["team_id", "season",name]
    return pd.merge(seed_and_names, df_for_melt, how="left", on=["season","team_id"])

seed_and_names = stat_against(seed_and_names,'score','points_against')
seed_and_names = stat_against(seed_and_names,'field_goals','fg_against')
seed_and_names = stat_against(seed_and_names,'field_goals_attempted','fga_against')
seed_and_names = stat_against(seed_and_names,'three_points','3pm_against')
seed_and_names = stat_against(seed_and_names,'three_points_attempted','3pa_against')
seed_and_names = stat_against(seed_and_names,'free_throws','ft_against')
seed_and_names = stat_against(seed_and_names,'free_throws_attempted','fta_against')
seed_and_names = stat_against(seed_and_names,'offensive_rebounds','off_rebounds_against')
seed_and_names = stat_against(seed_and_names,'defensive_rebounds','def_rebounds_against')
seed_and_names = stat_against(seed_and_names,'assists','assists_against')
seed_and_names = stat_against(seed_and_names,'steals','steals_against')
seed_and_names = stat_against(seed_and_names,'blocks','blocks_against')
seed_and_names = stat_against(seed_and_names,'turnovers','turnovers_against')
seed_and_names = stat_against(seed_and_names,'personal_fouls','fouls_against')

regular_season_total = seed_and_names.reset_index().drop(columns=["index"])

Train Test Split

As always, let's train test split (80% / 20%). Given it's time series we will split on years.
There are 15 years of data, so we'll train on 12 years and test on the final 3 years (2015, 2016, 2017)

y_test = regular_season_total[regular_season_total["season"]>=2015]["final_four"]
y_train = regular_season_total[regular_season_total["season"]<2015]["final_four"]

Note, I ran into error with Wichita St. since they were undefeated, so we need avoid indexing issues by imputing with 0.

regular_season_total.fillna(0,inplace=True)
X = regular_season_total[['season', 'seed', 'team_id', 'team_name', 'conference_code', 'points_for', 'points_against', 'fg_for', 'fg_against',
                '3pm_for', '3pm_against', 'fga_for', 'fga_against', '3pa_for',
                '3pa_against', 'ft_for', 'ft_against', 'fta_for', 'fta_against',
                'off_rebounds_for', 'off_rebounds_against', 'def_rebounds_for',
                'def_rebounds_against', 'assists_for', 'assists_against', 'steals_for',
                'steals_against', 'blocks_for', 'blocks_against', 'turnovers_for',
                'turnovers_against', 'fouls_for', 'fouls_against']]

X_train = X.query("season<2015")
X_test = X.query("season>=2015")

Imputation

Data is pretty complete and in the right type which is great. We'll just need to create multi class labels, such that we can identify which conferences a team is in, as I suspect that some teams play in more competitive conferences than others.
Let's label qualitative variables.
We'll standardize after splitting.

mapper = DataFrameMapper([
    (['season'],None),
    (['seed'],None),
    (['team_id'],None),
    (['conference_code'],LabelBinarizer()),    
    (['points_for'],None),
    (['points_against'],None),
    (['fg_for'],None),
    (['fg_against'],None),
    (['3pm_for'],None), 
    (['3pm_against'],None), 
    (['fga_for'],None),
    (['fga_against'],None),
    (['3pa_for'],None), 
    (['3pa_against'],None),
    (['ft_for'],None),  
    (['ft_against'],None),
    (['fta_for'],None),
    (['fta_against'],None),
    (['off_rebounds_for'],None),
    (['off_rebounds_against'],None),
    (['def_rebounds_for'],None),
    (['def_rebounds_against'],None),
    (['assists_for'],None),
    (['assists_against'],None),
    (['steals_for'],None),
    (['steals_against'],None),
    (['blocks_for'],None),
    (['blocks_against'],None),
    (['turnovers_for'],None),
    (['turnovers_against'],None),
    (['fouls_for'],None),
    (['fouls_against'],None)
],df_out=True)

Z_train = mapper.fit_transform(X_train)
Z_test = mapper.transform(X_test) 
# Remember, you never want to fit because if you see something you never saw before (e.g. emoji)
# then it will be 'labelbinarized', when it fact it should be treated as something diff instead

# Fix shape problems
# list(zip(Z_train.columns,Z_test.columns)) # there are less conferences in 2015 -> 2018
# mid_cont, pac_ten are in Z_train but not in Z_test

HTML(teams[teams["conference_code"]=="mid_cont"].drop_duplicates(subset="season").to_html(classes="table table-responsive table-striped table-bordered"))
# mid_cont only occured 2004->2007
# Chicago moved from mid_cont to gwc in 2010

	season	team_id	team_name	conference_code
14628	2004	1147	Centenary	mid_cont
14629	2005	1147	Centenary	mid_cont
14630	2006	1147	Centenary	mid_cont
14631	2007	1147	Centenary	mid_cont
16279	2003	1152	Chicago St	mid_cont

HTML(teams[teams["conference_code"]=="pac_ten"].drop_duplicates(subset="season").to_html(classes="table table-responsive table-striped table-bordered"))
# only occured 2003->2011;
# in 2012, Arizona played in pac_twelve

	season	team_id	team_name	conference_code
3601	2003	1112	Arizona	pac_ten
3602	2004	1112	Arizona	pac_ten
3603	2005	1112	Arizona	pac_ten
3604	2006	1112	Arizona	pac_ten
3605	2007	1112	Arizona	pac_ten
3606	2008	1112	Arizona	pac_ten
3607	2009	1112	Arizona	pac_ten
3608	2010	1112	Arizona	pac_ten
3609	2011	1112	Arizona	pac_ten

# let's be more verbose and label mid_cont in pac_ten columns in test, with 0 so they're the same shape
Z_test["conference_code_pac_ten"] = pd.DataFrame(np.zeros((Z_test.shape[0],1)))
Z_test["conference_code_mid_cont"] = pd.DataFrame(np.zeros((Z_test.shape[0],1)))

# Showing up as NaN -> need to make zero
Z_test["conference_code_pac_ten"].fillna(0,inplace=True)
Z_test["conference_code_mid_cont"].fillna(0,inplace=True)

Establish Benchmarks

Let's quickly spin up a naive baseline/benchmark model & naive out the door. Then, in subsequent steps, we can feature engineering and model to improve our score!

From FiveThirtyEight, 1 seeds only had a 35%-52% of reaching the final four.

Let's see how how our naive model's prediction fared.

# Baseline logistic 

# Step 1: Instantiate our model.
logreg_baseline = LogisticRegression(random_state=8)

# Step 2: Fit our model.
logreg_baseline.fit(Z_train,y_train)

# Step 3 (part 1): Generate prediction values
print(f'Number of teams making it to final four: {sum(logreg_baseline.predict(Z_train))}')

# Step 3 (part 2): Generate predictions/probabilities
logreg_baseline.predict_proba(Z_train)[:,1]

# Step 4: Score the model:
print(f' Logreg train accuracy: {logreg_baseline.score(Z_train,y_train)}')
print(f' Logreg test accuracy: {logreg_baseline.score(Z_test,y_test)}')

Number of teams making it to final four: 18.0
 Logreg train accuracy: 0.9444444444444444
 Logreg test accuracy: 0.9558823529411765

Imbalanced Classes

Accuracy of 94% looks good right? However, we have imbalanced classes.
Over 15 years, 60 final four teams, 936 non-final four, 996 total.
1-60/996 = 94%
Any naive model would be equally as good as this baseline logistic regression, by simply taking the accuracy mean.
This is a misleading accuracy as it's overestimated. Let's quickly spin up a AUC score to score the accuracy of this logistic classifier, and measure this imbalance.

roc_auc_score(y_test, logreg_baseline.predict_proba(Z_test)[:,1])

0.7300347222222222

Since 73% > 50%, the AUC tells us that this classifier is better than a no better than a 'no information' classifer. However there may be room for improvement.

To get a better evaluation of our baseline model, let's better represent the minority class (final four) with more signal, via over sampling, i.e. randomly sampling with replacement from available samples. I prefer this over SMOTE (Synthetic) because I would like to keep track of the number of games played, and to also keep the years and seeds an int.

# I added final_four to this mapper so I can split after, preserve order rather than randomly splitting
mapper_all = DataFrameMapper([
    (['season'], None),
    (['seed'], None),
    (['team_id'], None),
    (['conference_code'], LabelBinarizer()),    
    (['final_four'],None),
    (['points_for'],None),
    (['points_against'],None),
    (['fg_for'],None),
    (['fg_against'],None),
    (['3pm_for'],None), 
    (['3pm_against'],None), 
    (['fga_for'],None),
    (['fga_against'],None),
    (['3pa_for'],None), 
    (['3pa_against'],None),
    (['ft_for'],None),  
    (['ft_against'],None),
    (['fta_for'],None),
    (['fta_against'],None),
    (['off_rebounds_for'],None),
    (['off_rebounds_against'],None),
    (['def_rebounds_for'],None),
    (['def_rebounds_against'],None),
    (['assists_for'],None),
    (['assists_against'],None),
    (['steals_for'],None),
    (['steals_against'],None),
    (['blocks_for'],None),
    (['blocks_against'],None),
    (['turnovers_for'],None),
    (['turnovers_against'],None),
    (['fouls_for'],None),
    (['fouls_against'],None)
],df_out=True)

Z = mapper_all.fit_transform(regular_season_total)
y = Z["final_four"]
X = Z[['season', 'seed', 'team_id', 'conference_code_a_sun',
       'conference_code_a_ten', 'conference_code_aac', 'conference_code_acc',
       'conference_code_aec', 'conference_code_big_east',
       'conference_code_big_sky', 'conference_code_big_south',
       'conference_code_big_ten', 'conference_code_big_twelve',
       'conference_code_big_west', 'conference_code_caa',
       'conference_code_cusa', 'conference_code_horizon',
       'conference_code_ivy', 'conference_code_maac', 'conference_code_mac',
       'conference_code_meac', 'conference_code_mid_cont',
       'conference_code_mvc', 'conference_code_mwc', 'conference_code_nec',
       'conference_code_ovc', 'conference_code_pac_ten',
       'conference_code_pac_twelve', 'conference_code_patriot',
       'conference_code_sec', 'conference_code_southern',
       'conference_code_southland', 'conference_code_summit',
       'conference_code_sun_belt', 'conference_code_swac',
       'conference_code_wac', 'conference_code_wcc',
       'points_for', 'points_against', 'fg_for', 'fg_against', '3pm_for',
       '3pm_against', 'fga_for', 'fga_against', '3pa_for', '3pa_against',
       'ft_for', 'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
       'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
       'assists_for', 'assists_against', 'steals_for', 'steals_against',
       'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
       'fouls_for', 'fouls_against']]

random_sampler = RandomOverSampler(random_state=8)

X_resampled, y_resampled = random_sampler.fit_resample(X, y)
print(X_resampled.shape) 
print(y_resampled.shape) # there are 936 final four rows and 936 non-final four
print(sum(y_resampled))

(1872, 65)
(1872,)
936.0

Let's recalculate accuracy and AUC

df = pd.concat([pd.DataFrame(y_resampled),pd.DataFrame(X_resampled)],axis=1)
df.fillna(0,inplace=True)
df.columns = ['final_four','season', 'seed', 'team_id', 'conference_code_a_sun',
       'conference_code_a_ten', 'conference_code_aac', 'conference_code_acc',
       'conference_code_aec', 'conference_code_big_east',
       'conference_code_big_sky', 'conference_code_big_south',
       'conference_code_big_ten', 'conference_code_big_twelve',
       'conference_code_big_west', 'conference_code_caa',
       'conference_code_cusa', 'conference_code_horizon',
       'conference_code_ivy', 'conference_code_maac', 'conference_code_mac',
       'conference_code_meac', 'conference_code_mid_cont',
       'conference_code_mvc', 'conference_code_mwc', 'conference_code_nec',
       'conference_code_ovc', 'conference_code_pac_ten',
       'conference_code_pac_twelve', 'conference_code_patriot',
       'conference_code_sec', 'conference_code_southern',
       'conference_code_southland', 'conference_code_summit',
       'conference_code_sun_belt', 'conference_code_swac',
       'conference_code_wac', 'conference_code_wcc', 'points_for',
       'points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
       'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
       'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
       'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
       'assists_for', 'assists_against', 'steals_for', 'steals_against',
       'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
       'fouls_for', 'fouls_against']

# Let's split on the years again
y_test = df[df["season"]>=2015]["final_four"]
y_train = df[df["season"]<2015]["final_four"]

# 379/1872 so about 20%/80% split which is fine, I prefer the interpretability and preserve meaning.
# We are using 2003 -> 2014 data to predict and test 2015, 2016, 2017.

X = df.loc[:,"season":"fouls_against"]
X_test = X[X["season"]>=2015]
X_train = X[X["season"]<2015]

print(f' X_train shape {X_train.shape}')
print(f' y_train shape {y_train.shape}')
print("\n")
print(f' X_test shape {X_test.shape}')
print(f' y_test shape {y_test.shape}')

 X_train shape (1493, 65)
 y_train shape (1493,)


 X_test shape (379, 65)
 y_test shape (379,)

# Baseline logistic with balanced classes

# Step 1: Instantiate our model.
logreg_baseline = LogisticRegression(random_state=8)

# Step 2: Fit our model.
logreg_baseline.fit(X_train,y_train)

# Step 3 (part 1): Generate prediction values
print(f'Number of teams making it to final four: {sum(logreg_baseline.predict(X_train))}')

# Step 3 (part 2): Generate predictions/probabilities
logreg_baseline.predict_proba(X_train)[:,1]

# Step 4: Score the model:
print(f' Logreg train accuracy: {cross_val_score(logreg_baseline, X_train, y_train, cv=5).mean()}')
print(f' Logreg test accuracy: {cross_val_score(logreg_baseline, X_test, y_test, cv=5).mean()}')

Number of teams making it to final four: 858.0
 Logreg train accuracy: 0.8961679222548786
 Logreg test accuracy: 0.9473593073593072

roc_auc_score(y_test, logreg_baseline.predict_proba(X_test)[:,1])

0.7704155525846702

print(f' seed coeff {np.exp(-0.562391)}')
print(f' acc coeff {np.exp(-1.32)}')

 seed coeff 0.5698449344437365
 acc coeff 0.26713530196585034

The seed is very important. Remember, need to take exponential. If your seed increases by 1, your likelihood of reaching final four is 0.57x less. The individual team stats are less important, as I suspect the seed already captures a lot of that information. Season and team_id obviously do not matter as shown by 0 coefficient.

And what conference you play in matters (acc, big_twelve or big_ten). If you play in acc, your likelihood of reaching final four is 0.26x less, this is probably explaining for the teams who are not top tier.

HTML(pd.DataFrame(list(zip(X.columns,logreg_baseline.coef_.T[:,0]))).head(3).to_html(classes="table table-responsive table-striped table-bordered"))

	0	1
0	season	-0.00
1	seed	-0.56
2	team_id	-0.00

The AUC scores are more representative of the model's true performance. This basic logistic model was able to be accurate 90% of the time with training data and 95% with test data. Therefore, it's a good benchmark, with no strong indication of over-fitting, but let's try to improve through feature engineering (advanced basketball metrics!) and regularization.

Notice how the AUC score has improved now. With this model (trained on balanced classes), we are materially better than a 'no information' classifier. It's closer to 1 so it does not display signals of imbalanced classes, nor overwhelming amount of false positives vs false negatives. We'll revisit this later in model evaluation.

Exploratory Data Analysis

Let us perform EDA to better understand our data.

Summary Statistics

Correlation and Heatmap

Final_four is negatively correlated to seed (higher seed -> less likely to be in final four)
Final_four is positively correlated to fg_for (higher fg_for -> more likely to be in final four)
- Similar situation for blocks, assists, rebounds, points scored
Some of these independent variables are highly correlated, so I may not want to use all of them to reduce over-fitting.
- For example, points_for is related with points against, so I may want to combine those in a points differential variable, and offensive efficiency will combine (assists_for, fg_for)

quant = df[['final_four', 'season', 'seed', 'team_id', 'points_for',
       'points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
       'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
       'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
       'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
       'assists_for', 'assists_against', 'steals_for', 'steals_against',
       'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
       'fouls_for', 'fouls_against']]

stats = df[['points_for','points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
       'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
       'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
       'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
       'assists_for', 'assists_against', 'steals_for', 'steals_against',
       'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
       'fouls_for', 'fouls_against']]

fig, ax = plt.subplots(figsize=(8, 8))
corr = stats.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, xticklabels = True, yticklabels = True)

png

HTML(quant.describe().to_html(classes='table table-responsive table-striped'))

	final_four	season	seed	team_id	points_for	points_against	fg_for	fg_against	3pm_for	3pm_against	fga_for	fga_against	3pa_for	3pa_against	ft_for	ft_against	fta_for	fta_against	off_rebounds_for	off_rebounds_against	def_rebounds_for	def_rebounds_against	assists_for	assists_against	steals_for	steals_against	blocks_for	blocks_against	turnovers_for	turnovers_against	fouls_for	fouls_against
count	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00	1872.00
mean	0.50	2009.95	6.05	1290.26	2370.96	2049.25	836.08	729.79	213.76	192.89	1802.26	1780.15	588.58	585.63	485.04	396.78	686.40	577.32	379.41	351.31	788.69	688.99	469.75	376.67	224.75	193.06	133.78	101.09	403.02	446.56	553.43	605.75
std	0.50	4.34	4.79	99.57	310.72	254.79	115.35	95.10	47.18	35.78	233.19	229.30	116.29	103.06	81.73	74.40	111.37	106.74	73.81	60.89	109.08	94.69	78.42	58.24	49.50	34.69	50.44	23.60	63.21	76.71	79.72	78.79
min	0.00	2003.00	1.00	1102.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
25%	0.00	2006.00	2.00	1210.00	2226.00	1957.00	779.00	695.00	184.00	171.00	1708.50	1680.00	525.00	534.00	446.00	357.75	637.00	518.75	337.00	319.00	733.00	646.00	428.00	347.75	191.00	173.00	101.00	87.00	376.00	404.00	522.00	574.00
50%	0.50	2010.00	4.00	1277.00	2403.00	2061.00	838.00	733.00	212.00	192.00	1824.00	1802.00	595.00	588.00	488.00	394.50	698.00	582.00	376.00	355.00	798.00	698.00	467.00	375.00	224.00	194.00	123.00	102.00	398.00	444.00	561.00	615.00
75%	1.00	2014.00	10.00	1386.00	2552.00	2164.00	901.00	776.00	243.00	213.00	1916.00	1903.00	662.00	646.00	525.00	442.00	746.00	641.00	425.00	386.25	851.00	742.00	512.00	406.00	252.00	211.00	159.25	114.00	440.00	493.00	599.00	647.00
max	1.00	2017.00	16.00	1463.00	3016.00	2657.00	1113.00	999.00	342.00	335.00	2245.00	2248.00	923.00	899.00	696.00	648.00	1020.00	921.00	555.00	510.00	1021.00	956.00	709.00	564.00	402.00	291.00	299.00	179.00	584.00	695.00	795.00	795.00

While the team's season statistics are useful, it may be more predictive to understand a team's performance on a per game basis, to understand how it's winning or losing its games.

The main categorical variable is conferences. It appears to also be predictive from the baseline, representing the strength of the conference. Some conferences are generally very strong and play stronger competition than others.

big_ten (Michigan State, Purdue, Michigan)
big_twelve (Texas Tech, Kansas State)
acc (Duke, UNC, Virgnia)

See that big_east, acc, big_ten, sec, big_twelve all have 5+ teams that have made it to the final four. These are the conferences with the best teams.

HTML(pd.DataFrame(regular_season_total.groupby("conference_code")["final_four"].sum()).sort_values(by="final_four",ascending=False)[0:5].to_html(classes='table table-responsive table-striped'))

	final_four
conference_code
big_east	11.00
acc	10.00
big_ten	10.00
sec	9.00
big_twelve	6.00

Some other observations include: * No seeds greater than 11 have made it to final four * Teams which make the final four, in the regular season score more, make more FG, give up less FTs, rebound more, block more, foul less, ... all the attributes of a great basketball team

Feature Engineering

'Standardize' features

Let's keep the stat metrics on the same scale, on a per game basis rather than standardizing to keep it interpretable. First we need to count the number of games for each team.

regular["game_won"] = np.ones((regular.shape[0], 1))
df_gw = regular.groupby(["season","winning_team_id"]).count().reset_index()[["season", "winning_team_id","game_won"]]
df_gw.columns = ["season", "team_id","game_won"]

regular["game_lost"] = np.ones((regular.shape[0], 1))
df_gl = regular.groupby(["season","losing_team_id"]).count().reset_index()[["season", "losing_team_id","game_lost"]]
df_gl.columns = ["season", "team_id","game_lost"]

df = pd.merge(df,df_gl,how="left",on=['season','team_id'])

df = pd.merge(df,df_gw,how="left",on=['season','team_id'])

df["game_total"] = df["game_won"] + df["game_lost"]
df.fillna(0,inplace=True)

# Effective gf 
# This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
df["efg%"]=(df["fg_for"] + 0.5*df["3pm_for"]) / df["fga_for"]

df.loc[:,'points_for':'fouls_against'] = df[['points_for', 'points_against', 'fg_for', 'fg_against', '3pm_for', '3pm_against',
 'fga_for', 'fga_against', '3pa_for', '3pa_against', 'ft_for',
 'ft_against', 'fta_for', 'fta_against', 'off_rebounds_for',
   'off_rebounds_against', 'def_rebounds_for', 'def_rebounds_against',
   'assists_for', 'assists_against', 'steals_for', 'steals_against',
   'blocks_for', 'blocks_against', 'turnovers_for', 'turnovers_against',
   'fouls_for', 'fouls_against']].apply(lambda x: x / df.game_total)

Transform and add new features

Let's use some advanced metrics from the NBA. Note, opponent % are kept as to gauge defense as well.

# Now, instead of games_won and games_lost, let's combine this into win_%
df['win_%']=df["game_won"]/df["game_total"]

# +/- point differential aka margin of victory
df["margin_of_victory"]=df["points_for"]-df["points_against"]

# Assist to turnover ratio: this measures your ability to care of possessions and pass the ball, as a team.
df["ast_to_ratio"] = df["assists_for"]/df["turnovers_for"]

# Let's convert field goals, threes and fts into %. Will reduce collinearity and also simplify interpretation.
df['fg%_for'] = df['fg_for']/df['fga_for']
df['fg%_against'] = df['fg_against']/df['fga_against']

df['3p%_for'] = X['3pm_for']/X['3pa_for']
df['3p%_against'] = df['3pm_against']/df['3pa_against']

df['ft%_for'] = df['ft_for']/X['fta_for']
df['ft%_against'] = df['ft_against']/df['fta_against']

# drop games won, lost, its redundant with win_%
df = df.drop(["game_won","game_lost"],axis=1)

# Some games have no data, probably from sampling these teams don't exist. Let's just drop for now since I'm getting NaN.
df.dropna(inplace=True)

df.to_csv("./data/df.csv")
df.shape # 1856 observations

(1856, 77)

I will keep the original features and use regularization methods in the next section.

Modelling

Logistic Regression with Regularization

L1: Lasso

I prefer Lasso as a shrinkage method, to narrow down on the most important features, since our baseline model was very accurate already.

y_test = df[df["season"]>=2015]["final_four"]
y_train = df[df["season"]<2015]["final_four"]
X_train = df.query("season<2015").loc[:,"season":"ft%_against"]
X_test = df.query("season>=2015").loc[:,"season":"ft%_against"]

print(X_train.shape) #80%
print(X_test.shape) #20%

# Instantiate Model
logreg_lasso_1 = LogisticRegression(penalty='l1', C=1)

# Fit model.
logreg_lasso_1.fit(X_train, y_train)

print(f'Logistic Regression Intercept: {logreg_lasso_1.intercept_}')
print(f'Logistic Regression Coefficient: {logreg_lasso_1.coef_}')
print("\n")

# Generate prediction values
print(f'Number of teams making it to final four:{sum(logreg_lasso_1.predict(X_train))}')
print("\n")

# Generate predictions/probabilities
print(logreg_lasso_1.predict_proba(X_train)[:,1])
print("\n")

(1492, 76)
(364, 76)
Logistic Regression Intercept: [0.]
Logistic Regression Coefficient: [[-0.0012153  -0.6837633   0.00072333  0.         -2.44191693  2.48408599
  -0.964152    0.         -0.31075531  0.          0.          0.80899047
  -1.47521544  0.          6.28137961  0.93520549  5.1251736   0.
   0.          0.          0.          0.          1.91588091 -2.56345391
   0.          0.          0.         -2.5507449   0.          0.24535897
   0.          0.          0.          0.          0.          0.
  -3.40287112  0.09635222  0.03212444  0.15745544  0.         -1.68313285
   1.38608677 -0.33650906  0.20264111  0.48025645 -0.49238313  0.19572923
  -0.4448577  -0.24921486  0.67154919  0.40774297 -0.13740773 -0.16521916
   0.01779028 -0.45970706  0.19248742  0.21530272 -0.02207059 -0.01640808
   0.23174252  0.26819881 -0.35574917  0.         -0.47109025  0.24791855
   0.         -2.02168244  0.44660712  0.          0.          0.
   0.          0.          0.          0.        ]]


Number of teams making it to final four:832.0


[0.45132789 0.8439776  0.88002624 ... 0.97316549 0.7337679  0.79253303]

Interpretation of Coefficients

lasso_coef = pd.DataFrame(list(zip(X_train.columns,logreg_lasso_1.coef_.T[:,0])))
lasso_coef.columns = ["variable","coef"]
lasso_coef["coef_abs"]=abs(lasso_coef["coef"])
HTML(lasso_coef.sort_values(by="coef_abs",ascending=False)[0:25].to_html(classes='table table-responsive table-striped'))

	variable	coef	coef_abs
14	conference_code_caa	6.28	6.28
16	conference_code_horizon	5.13	5.13
36	conference_code_wcc	-3.40	3.40
23	conference_code_mwc	-2.56	2.56
27	conference_code_pac_twelve	-2.55	2.55
5	conference_code_aac	2.48	2.48
4	conference_code_a_ten	-2.44	2.44
67	win_%	-2.02	2.02
22	conference_code_mvc	1.92	1.92
41	3pm_for	-1.68	1.68
12	conference_code_big_twelve	-1.48	1.48
42	3pm_against	1.39	1.39
6	conference_code_acc	-0.96	0.96
15	conference_code_cusa	0.94	0.94
11	conference_code_big_ten	0.81	0.81
1	seed	-0.68	0.68
50	fta_against	0.67	0.67
46	3pa_against	-0.49	0.49
45	3pa_for	0.48	0.48
64	fouls_against	-0.47	0.47
55	assists_for	-0.46	0.46
68	margin_of_victory	0.45	0.45
48	ft_against	-0.44	0.44
51	off_rebounds_for	0.41	0.41
62	turnovers_against	-0.36	0.36

Outside of conferences, win_% is the most important predictor. Although it's strange to see that it has a negative sign, this is probably the case because schools that make to to the final four actually have lower win % than schools who do not make it to the final four because they face tougher competition (57% avg vs 72% avg). This supports the argument that which conference you play for is very important in predictive power.
Again, your strongest conferences, are the big_twelve, acc, big_ten. If you play in acc, your likelihood of reaching final four is very low, because you will likely be dominated by the top tier schools in those conferences.
In terms of regular season states, 3PM for and against is important, as the game continues to move shooting more threes, as it's been proven to be an effective strategy.
Lastly, seed is not to be overlooked as it's a good composite score for a team's strength, capturing a lot of regular season performance. As your seed increases by 1, your likelihood of reaching final four is 0.50x less. This makes sense, as a 3 seed only has a ~12.5% (0.5^3) of making to final four! This is consistent with FiveThirtyEight's forecasts. Purdue, Houston, & Texas Tech had 10-14% of making it to the Final four.
Similar to the NBA, margin of victory is a predictive variable though with a positive correlation. In the NBA, there is more parity since each team plays one another at least twice in a season, so individual team stats are more important than seed & conference.

Model Evaluation

# Score the model - accuracy
print(f' Logreg train accuracy: {cross_val_score(logreg_lasso_1, X_train, y_train, cv=5).mean()}')
print("\n")

print(f' Logreg test accuracy: {cross_val_score(logreg_lasso_1, X_test, y_test, cv=5).mean()}')
print("\n")

# Area under the curve
print(f' Area under the curve: {roc_auc_score(y_test, logreg_lasso_1.predict_proba(X_test)[:,1])}')

 Logreg train accuracy: 0.8981365433947868


 Logreg test accuracy: 0.9203453453453454


 Area under the curve: 0.8046269379844961

Metric	Baseline	Logistic Regression, Lasso = 1
Train Accuracy	94%	90%
Test Accuracy	96%	92%
Area under the curve	77%	80%

Even with fewer variables, this model is arguably as performant as the original. It does not overfit, and even improves the AUC score by 3%.

# create data frame of true values and predicted probabilities on test set
pred_proba = [i[1] for i in logreg_lasso_1.predict_proba(X_test)]

pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})

# Create figure.
plt.figure(figsize = (8,8))

# Create threshold values.
thresholds = np.linspace(0, 1, 200)

# Define function to calculate sensitivity. (True positive rate.)
def TPR(df, true_col, pred_prob_col, threshold):
    true_positive = df[(df[true_col] == 1) & (df[pred_prob_col] >= threshold)].shape[0]
    false_negative = df[(df[true_col] == 1) & (df[pred_prob_col] < threshold)].shape[0]
    return true_positive / (true_positive + false_negative)


# Define function to calculate 1 - specificity. (False positive rate.)
def FPR(df, true_col, pred_prob_col, threshold):
    true_negative = df[(df[true_col] == 0) & (df[pred_prob_col] <= threshold)].shape[0]
    false_positive = df[(df[true_col] == 0) & (df[pred_prob_col] > threshold)].shape[0]
    return 1 - (true_negative / (true_negative + false_positive))

# Calculate sensitivity & 1-specificity for each threshold between 0 and 1.
tpr_values = [TPR(pred_df, 'true_values', 'pred_probs', prob) for prob in thresholds]
fpr_values = [FPR(pred_df, 'true_values', 'pred_probs', prob) for prob in thresholds]

# Plot ROC curve.
plt.plot(fpr_values, # False Positive Rate on X-axis
         tpr_values, # True Positive Rate on Y-axis
         label='ROC Curve')

# Plot baseline. (Perfect overlap between the two populations.)
plt.plot(np.linspace(0, 1, 200),
         np.linspace(0, 1, 200),
         label='baseline',
         linestyle='--')

# Label axes.
plt.title(f'ROC Curve with AUC = {round(roc_auc_score(pred_df["true_values"], pred_df["pred_probs"]),2)}', fontsize=20)
plt.ylabel('Sensitivity', fontsize=15)
plt.xlabel('1 - Specificity', fontsize=15)

# Create legend.
plt.legend(fontsize=16);

png

cm = confusion_matrix(y_test, logreg_lasso_1.predict(X_test))
cm_df = pd.DataFrame(data=cm, columns=['predicted negative', 'predicted positive'], index=['actual negative', 'actual positive'])
HTML(cm_df.to_html(classes='table table-responsive table-striped'))

	predicted negative	predicted positive
actual negative	167	25
actual positive	69	103

There are more false negatives than false positives (predicting no final four, when it is final four). The shape of the ROC Curve is due to the random over sampling.

Hyperparameter Selection

Let's fine tune our hyperparameter of alpha with GridSearchCV to see if we can improve our model further.

log_params = {
    'penalty':["l1", "l2"],
    'C':list(np.linspace(0.01, 5, 10))
}

log_gridsearch = GridSearchCV(LogisticRegression(random_state=8), log_params, cv=5, verbose=1, n_jobs=2)

log_gridsearch = log_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   36.1s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:  1.4min finished

print(log_gridsearch.best_score_)
log_gridsearch.best_params_

0.9175603217158177





{'C': 4.445555555555556, 'penalty': 'l1'}

L1 penalty outperforms L2 penalty. Let's gragh the relationship between the strength of C (lambda) and test accuracy.

plt.figure(figsize = (8,8))

lst_of_c = [c["C"] for c in pd.DataFrame(log_gridsearch.cv_results_)["params"]]
mean_test_scores = pd.DataFrame(log_gridsearch.cv_results_)["mean_test_score"]
plt.plot(lst_of_c, 
         mean_test_scores)
plt.title(f'Penalty Strength vs Test Accuracy: {round(log_gridsearch.best_score_,3)}', fontsize=15)
plt.ylabel('C', fontsize=10);
plt.xlabel('%', fontsize=10);

png

print(log_gridsearch.best_score_)
log_gridsearch.best_params_

0.9175603217158177





{'C': 4.445555555555556, 'penalty': 'l1'}

Best Model

# Instantiate Model
logreg_best = LogisticRegression(penalty='l1', C=log_gridsearch.best_params_["C"])

# Fit model.
logreg_best.fit(X_train, y_train)

# Generate prediction values
yhat_train = logreg_lasso_1.predict(X_train)
yhat_test = logreg_lasso_1.predict(X_test)
print("\n")

# Generate predictions/probabilities
yhat_train_proba = logreg_best.predict_proba(X_train)[:,1]
yhat_test_proba = logreg_best.predict_proba(X_test)[:,1]

# Score the model - accuracy
print(f' Logreg train accuracy: {cross_val_score(logreg_best, X_train, y_train, cv=5).mean()}')
print("\n")

print(f' Logreg test accuracy: {cross_val_score(logreg_best, X_test, y_test, cv=5).mean()}')
print("\n")

# Area under the curve
print(f' Area under the curve: {roc_auc_score(y_test, logreg_best.predict_proba(X_test)[:,1])}')

 Logreg train accuracy: 0.9168948657714342


 Logreg test accuracy: 0.9341591591591591


 Area under the curve: 0.786125242248062

Metric	Baseline	Logistic Regression, Lasso = 1	Logistic Regression, Lasso = Optimal
Train Accuracy	94%	90%	92%
Test Accuracy	96%	92%	93%
Area under the curve	77%	80%	79%

pred_15_17 = pd.DataFrame(zip(X_test[["season","team_id"]],yhat_test))
pred_15_17 = pd.DataFrame(
    {"season": X_test["season"],
     "team_id": X_test["team_id"],
     "final_four": y_test,
     "yhat": yhat_test  
    }
)

results = pd.merge(pred_15_17,teams, how="left",on=["season","team_id"]).drop_duplicates().reset_index()
results.sort_values(by="final_four",ascending=False).drop("index",axis=1)
HTML(results.query('final_four==1').to_html(classes='table table-striped table-responsive'))

	index	season	team_id	final_four	yhat	team_name	conference_code
6	155	2015.00	1277.00	1.00	0.00	Michigan St	big_ten
17	421	2015.00	1181.00	1.00	1.00	Duke	acc
50	1077	2015.00	1458.00	1.00	1.00	Wisconsin	big_ten
67	1440	2016.00	1314.00	1.00	1.00	North Carolina	acc
94	2048	2016.00	1393.00	1.00	0.00	Syracuse	acc
102	2204	2016.00	1437.00	1.00	1.00	Villanova	big_east
119	2498	2016.00	1328.00	1.00	0.00	Oklahoma	big_twelve
141	2997	2017.00	1376.00	1.00	0.00	South Carolina	sec
153	3244	2017.00	1211.00	1.00	1.00	Gonzaga	wcc
171	3556	2017.00	1332.00	1.00	0.00	Oregon	pac_twelve
186	3857	2017.00	1314.00	1.00	1.00	North Carolina	acc

Results

Overall, I was able to improve from the baseline logistic model by and 2% in ROC curve. What I am impressed about is that the best model via Lasso is able to predict with less variables, while also maintaining interpretability (no wild transformations), with almost the same level of accuracy for test.

In summary:

Win_% is the most important predictor. Although it's strange to see that it has a negative sign, this is probably the case because schools that make to to the final four actually have lower win % than schools who do not make it to the final four because they face tougher competition (57% avg vs 72% avg). This supports the argument that which conference you play for is very important in predictive power.
Again, your strongest conferences, are the big_twelve, acc, big_ten. If you play in acc, your likelihood of reaching final four is very low, because you will likely be dominated by the top tier schools in those conferences.
In terms of regular season states, 3PM for and against is important, as the game continues to move shooting more threes, as it's been proven to be an effective strategy.
Lastly, seed is not to be overlooked as it's a good composite score for a team's strength, capturing a lot of regular season performance. As your seed increases by 1, your likelihood of reaching final four is 0.5x less. This makes sense... a 3 seed only has a ~12.5% (0.5^3) of making to final four! This is consistent with FiveThirtyEight's forecasts. Purdue, Houston, & Texas Tech had 10-14% of making it to the Final four.
Similar to the NBA, margin of victory is a predictive variable though with a positive correlation. In the NBA, there is more parity since each team plays one another at least twice in a season, so individual team stats are more important than seed & conference.

My model predicts the probability of a team reaching the final four, primarily based on its regular season data, seed, and conference. I would recommend coaches to develop the three ball ability, as it continues dominate in the NBA and at the college level, and to outperform in the regular season to improve your seeding, to ultimately improve your chances in moving to the final four, through playing weaker opponents.

Assumptions

In this model, there are things that we cannot predict, including injuries, player specific data, draft prospects, travel & time between games (are players are well rested?), coaches track record, momentum of a team firing on all cylinders (last 10 games), given the absence of data.

In this example, there isn't a penalty on false positives vs false negatives, as wrong is wrong. Predicting a loser who won isn't worse than a winner that lost. There may be implications in gambling, but that's out of scope.

Research Links

https://projects.fivethirtyeight.com/2019-march-madness-predictions/

Next Steps

This was a very fun exploration and may plan to revisit this project in a future blog post.

Ideas:

Simulating winner between 2 games (or at the very least, the final four), using a Poisson model for pts scored
Logistic vs Non Parametric (KNN, SVM, random forests)
Scrap 2018 data, and run model, and add graphs (with SMOTE, 2015->2017 data is obfuscated)
Feature engineer strength of schedule (they have this on sports reference)
ELO
No. of previous appearances in history
Confidence intervals of predicted probabilities