NCAA Basketball Tutorial

SportFlow Running Time: Approximately 15 minutes

NCAAB ROC Curve

In this tutorial, we use machine learning to predict whether or not an NCAA Men’s Basketball team will cover the spread. The spread is set by Las Vegas bookmakers to balance the betting; it is a way of giving points to the underdog to encourage bets on both sides.

SportFlow starts with the basic data and derives time series features based on streaks and runs (not the baseball runs). In the table below, the game data includes both line and over_under information consolidated from various sports Web sites. For example, a line of -9 means the home team is favored by 9 points. A line of +3 means the away team is favored by 3 points; the line is always relative to the home team. An over_under is the predicted total score for the game, with a bet being placed on whether not the final total will be under or over that amount.

NCAA Basketball Data

season

date

away.team

away.score

home.team

home.score

line

over_under

2015

2015-11-13

COLO

62

ISU

68

-10

151

2015

2015-11-13

SDAK

69

WRST

77

-6.5

136

2015

2015-11-13

WAG

57

SJU

66

-5.5

142

2015

2015-11-13

JVST

83

CMU

89

-18

142.5

2015

2015-11-13

NIAG

50

ODU

67

-18

132

2015

2015-11-13

ALBY

65

UK

78

-20

132.5

2015

2015-11-13

TEM

67

UNC

91

-9.5

145

2015

2015-11-13

NKU

61

WVU

107

-23.5

147.5

2015

2015-11-13

SIE

74

DUKE

92

-24

155

2015

2015-11-13

WCU

72

CIN

97

-20

132

2015

2015-11-13

MSM

56

MD

80

-21.5

140

2015

2015-11-13

CHAT

92

UGA

90

-10.5

136

2015

2015-11-13

SEMO

53

DAY

84

-19

140

2015

2015-11-13

DART

67

HALL

84

-11

136

2015

2015-11-13

CAN

85

HOF

96

-10.5

150.5

2015

2015-11-13

JMU

87

RICH

75

-9

137.5

2015

2015-11-13

EIU

49

IND

88

-25

150

2015

2015-11-13

FAU

55

MSU

82

-23.5

141

2015

2015-11-13

SAM

45

LOU

86

-23

142

2015

2015-11-13

MIOH

72

XAV

81

-15.5

144.5

2015

2015-11-13

PRIN

64

RID

56

1

137

2015

2015-11-13

IUPU

72

INST

70

-8

135.5

2015

2015-11-13

SAC

66

ASU

63

-18

144

2015

2015-11-13

AFA

75

SIU

77

-5.5

131

2015

2015-11-13

UNCO

72

KU

109

-29

147.5

2015

2015-11-13

BALL

53

BRAD

54

3

135

2015

2015-11-13

USD

45

USC

83

-12.5

140

2015

2015-11-13

UTM

57

OKST

91

-12

141.5

2015

2015-11-13

COR

81

GT

116

-17

130

2015

2015-11-13

MOST

65

ORU

80

-4.5

133.5

2015

2015-11-13

DREX

81

JOES

82

-9.5

127.5

2015

2015-11-13

WMRY

85

NCST

68

-12.5

149

2015

2015-11-13

SF

78

UIC

75

1.5

148.5

2015

2015-11-13

PEAY

41

VAN

80

-24.5

144

2015

2015-11-13

CSN

71

NIU

83

-9.5

134.5

2015

2015-11-13

UCSB

60

OMA

59

-2.5

157.5

2015

2015-11-13

UTSA

64

LOYI

76

-14

138.5

2015

2015-11-13

BRWN

65

SPU

77

-2

130.5

2015

2015-11-13

NAU

70

WSU

82

-10.5

145

Step 1: First, from the examples directory, change your directory:

cd NCAAB

Before running SportFlow, let’s briefly review the configuration files in the config directory:

sport.yml:

The SportFlow configuration file

model.yml:

The AlphaPy configuration file

In sport.yml, the first three items are used for random_scoring, which we will not be doing here. By default, we will create a model based on all seasons and calculate short-term streaks of 3 with the rolling_window.

sport.yml
sport:
    league          : NCAAB
    points_max      : 100
    points_min      : 50
    random_scoring  : False
    seasons         : []
    rolling_window  : 3

In each of the tutorials, we experiment with different options in model.yml to run AlphaPy. Here, we will run a random forest classifier with Recursive Feature Elimination and Cross-Validation (RFECV), and then an XGBoost classifier. We will also perform a random grid search, which increases the total running time to approximately 15 minutes. You can get in some two-ball dribbling while waiting for SportFlow to finish.

In the features section, we identify the factors generated by SportFlow. For example, we want to treat the various streaks as factors. Other options are interactions, standard scaling, and a threshold for removing low-variance features.

Our target variable is won_on_spread, a Boolean indicator of whether or not the home team covered the spread. This is what we are trying to predict.

model.yml
project:
    directory         : .
    file_extension    : csv
    submission_file   : 
    submit_probas     : False

data:
    drop              : ['Unnamed: 0', 'index', 'season', 'date', 'home.team', 'away.team',
                         'home.score', 'away.score', 'total_points', 'point_margin_game',
                         'won_on_points', 'lost_on_points', 'cover_margin_game',
                         'lost_on_spread', 'overunder_margin', 'over', 'under']
    features          : '*'
    sampling          :
        option        : False
        method        : under_random
        ratio         : 0.0
    sentinel          : -1
    separator         : ','
    shuffle           : False
    split             : 0.4
    target            : won_on_spread
    target_value      : True

model:
    algorithms        : ['RF', 'XGB']
    balance_classes   : False
    calibration       :
        option        : False
        type          : isotonic
    cv_folds          : 3
    estimators        : 201
    feature_selection :
        option        : False
        percentage    : 50
        uni_grid      : [5, 10, 15, 20, 25]
        score_func    : f_classif
    grid_search       :
        option        : True
        iterations    : 50
        random        : True
        subsample     : False
        sampling_pct  : 0.25
    pvalue_level      : 0.01
    rfe               :
        option        : True
        step          : 5
    scoring_function  : 'roc_auc'
    type              : classification

features:
    clustering        :
        option        : False
        increment     : 3
        maximum       : 30
        minimum       : 3
    counts            :
        option        : False
    encoding          :
        rounding      : 3
        type          : factorize
    factors           : ['line', 'delta.wins', 'delta.losses', 'delta.ties',
                         'delta.point_win_streak', 'delta.point_loss_streak',
                         'delta.cover_win_streak', 'delta.cover_loss_streak',
                         'delta.over_streak', 'delta.under_streak']
    interactions      :
        option        : True
        poly_degree   : 2
        sampling_pct  : 5
    isomap            :
        option        : False
        components    : 2
        neighbors     : 5
    logtransform      :
        option        : False
    numpy             :
        option        : False
    pca               :
        option        : False
        increment     : 3
        maximum       : 15
        minimum       : 3
        whiten        : False
    scaling           :
        option        : True
        type          : standard
    scipy             :
        option        : False
    text              :
        ngrams        : 1
        vectorize     : False
    tsne              :
        option        : False
        components    : 2
        learning_rate : 1000.0
        perplexity    : 30.0
    variance          :
        option        : True
        threshold     : 0.1

pipeline:
    number_jobs       : -1
    seed              : 13201
    verbosity         : 0

plots:
    calibration       : True
    confusion_matrix  : True
    importances       : True
    learning_curve    : True
    roc_curve         : True

xgboost:
    stopping_rounds   : 30

Step 2: Now, let’s run SportFlow:

sflow --pdate 2016-03-01

As sflow runs, you will see the progress of the workflow, and the logging output is saved in sport_flow.log. When the workflow completes, your project structure will look like this, with a different datestamp:

NCAAB
├── sport_flow.log
├── config
    ├── algos.yml
    ├── sport.yml
    ├── model.yml
└── data
    ├── ncaab_game_scores_1g.csv
└── input
    ├── test.csv
    ├── train.csv
└── model
    ├── feature_map_20170427.pkl
    ├── model_20170427.pkl
└── output
    ├── predictions_20170427.csv
    ├── probabilities_20170427.csv
    ├── rankings_20170427.csv
└── plots
    ├── calibration_test.png
    ├── calibration_train.png
    ├── confusion_test_RF.png
    ├── confusion_test_XGB.png
    ├── confusion_train_RF.png
    ├── confusion_train_XGB.png
    ├── feature_importance_train_RF.png
    ├── feature_importance_train_XGB.png
    ├── learning_curve_train_RF.png
    ├── learning_curve_train_XGB.png
    ├── roc_curve_test.png
    ├── roc_curve_train.png

Depending upon the model parameters and the prediction date, the AUC of the ROC Curve will vary between 0.54 and 0.58. This model is barely passable, but we are getting a slight edge even with our basic data. We will need more game samples to have any confidence in our predictions.

ROC Curve

After a model is created, we can run sflow in predict mode. Just specify the prediction date pdate, and SportFlow will make predictions for all cases in the predict.csv file on or after the specified date. Note that the predict.csv file is generated on the fly in predict mode and stored in the input directory.

Step 3: Now, let’s run SportFlow in predict mode, where all results will be stored in the output directory:

sflow --predict --pdate 2016-03-15

Conclusion Even with just one season of NCAA Men’s Basketball data, our model predicts between 52-54% accuracy. To attain better accuracy, we need more historical data vis a vis the number of games and other types of information such as individual player statistics. If you want to become a professional bettor, then you need at least 56% winners to break the bank.