SportFlow

Sports Pipeline

SportFlow applies machine learning algorithms to predict game outcomes for matches in any team sport. We created binary features (for classification) to determine whether or not a team will win the game or even more importantly, cover the spread. We also try to predict whether or not a game’s total points will exceed the over/under.

Of course, there are practical matters to predicting a game’s outcome. The strength of supervised learning is to improve an algorithm’s performance with lots of data. While major-league baseball has a total of 2,430 games per year, pro football has only 256 games per year. College football and basketball are somewhere in the middle of this range.

The other complication is determining whether or not a model for one sport can be used for another. The advantage is that combining sports gives us more data. The disadvantage is that each sport has unique characteristics that could make a unified model infeasible. Still, we can combine the game data to test an overall model.

Data Sources

SportFlow starts with minimal game data (lines and scores) and expands these data into temporal features such as runs and streaks for all of the features. Currently, we do not incorporate player data or other external factors, but there are some excellent open-source packages such as BurntSushi’s nflgame Python code. For its initial version, SportFlow game data must be in the format below:

NCAA Basketball Data

season

date

away.team

away.score

home.team

home.score

line

over_under

2015

2015-11-13

COLO

62

ISU

68

-10

151

2015

2015-11-13

SDAK

69

WRST

77

-6.5

136

2015

2015-11-13

WAG

57

SJU

66

-5.5

142

2015

2015-11-13

JVST

83

CMU

89

-18

142.5

2015

2015-11-13

NIAG

50

ODU

67

-18

132

2015

2015-11-13

ALBY

65

UK

78

-20

132.5

2015

2015-11-13

TEM

67

UNC

91

-9.5

145

2015

2015-11-13

NKU

61

WVU

107

-23.5

147.5

2015

2015-11-13

SIE

74

DUKE

92

-24

155

2015

2015-11-13

WCU

72

CIN

97

-20

132

2015

2015-11-13

MSM

56

MD

80

-21.5

140

2015

2015-11-13

CHAT

92

UGA

90

-10.5

136

2015

2015-11-13

SEMO

53

DAY

84

-19

140

2015

2015-11-13

DART

67

HALL

84

-11

136

2015

2015-11-13

CAN

85

HOF

96

-10.5

150.5

2015

2015-11-13

JMU

87

RICH

75

-9

137.5

2015

2015-11-13

EIU

49

IND

88

-25

150

2015

2015-11-13

FAU

55

MSU

82

-23.5

141

2015

2015-11-13

SAM

45

LOU

86

-23

142

2015

2015-11-13

MIOH

72

XAV

81

-15.5

144.5

2015

2015-11-13

PRIN

64

RID

56

1

137

2015

2015-11-13

IUPU

72

INST

70

-8

135.5

2015

2015-11-13

SAC

66

ASU

63

-18

144

2015

2015-11-13

AFA

75

SIU

77

-5.5

131

2015

2015-11-13

UNCO

72

KU

109

-29

147.5

2015

2015-11-13

BALL

53

BRAD

54

3

135

2015

2015-11-13

USD

45

USC

83

-12.5

140

2015

2015-11-13

UTM

57

OKST

91

-12

141.5

2015

2015-11-13

COR

81

GT

116

-17

130

2015

2015-11-13

MOST

65

ORU

80

-4.5

133.5

2015

2015-11-13

DREX

81

JOES

82

-9.5

127.5

2015

2015-11-13

WMRY

85

NCST

68

-12.5

149

2015

2015-11-13

SF

78

UIC

75

1.5

148.5

2015

2015-11-13

PEAY

41

VAN

80

-24.5

144

2015

2015-11-13

CSN

71

NIU

83

-9.5

134.5

2015

2015-11-13

UCSB

60

OMA

59

-2.5

157.5

2015

2015-11-13

UTSA

64

LOYI

76

-14

138.5

2015

2015-11-13

BRWN

65

SPU

77

-2

130.5

2015

2015-11-13

NAU

70

WSU

82

-10.5

145

The SportFlow logic is split-apply-combine, as the data are first split along team lines, then team statistics are calculated and applied, and finally the team data are inserted into the overall model frame.

Domain Configuration

The SportFlow configuration file is minimal. You can simulate random scoring to compare with a real model. Further, you can experiment with the rolling window for run and streak calculations.

sport.yml
sport:
    points_max      : 100
    points_min      : 50
    random_scoring  : False
    seasons         : []
    rolling_window  : 3
points_max:

Maximum number of simulated points to assign to any single team.

points_min:

Minimum number of simulated points to assign to any single team.

random_scoring:

If True, assign random point values to games [Default: False].

seasons:

The yearly list of seasons to evaluate.

rolling_window:

The period over which streaks are calculated.

Model Configuration

SportFlow runs on top of AlphaPy, so the model.yml file has the same format.

model.yml
project:
    directory         : .
    file_extension    : csv
    submission_file   : 
    submit_probas     : False

data:
    drop              : ['Unnamed: 0', 'index', 'season', 'date', 'home.team', 'away.team',
                         'home.score', 'away.score', 'total_points', 'point_margin_game',
                         'won_on_points', 'lost_on_points', 'cover_margin_game',
                         'lost_on_spread', 'overunder_margin', 'over', 'under']
    features          : '*'
    sampling          :
        option        : False
        method        : under_random
        ratio         : 0.0
    sentinel          : -1
    separator         : ','
    shuffle           : False
    split             : 0.4
    target            : won_on_spread
    target_value      : True

model:
    algorithms        : ['RF', 'XGB']
    balance_classes   : False
    calibration       :
        option        : False
        type          : isotonic
    cv_folds          : 3
    estimators        : 201
    feature_selection :
        option        : False
        percentage    : 50
        uni_grid      : [5, 10, 15, 20, 25]
        score_func    : f_classif
    grid_search       :
        option        : True
        iterations    : 50
        random        : True
        subsample     : False
        sampling_pct  : 0.25
    pvalue_level      : 0.01
    rfe               :
        option        : True
        step          : 5
    scoring_function  : 'roc_auc'
    type              : classification

features:
    clustering        :
        option        : False
        increment     : 3
        maximum       : 30
        minimum       : 3
    counts            :
        option        : False
    encoding          :
        rounding      : 3
        type          : factorize
    factors           : ['line', 'delta.wins', 'delta.losses', 'delta.ties',
                         'delta.point_win_streak', 'delta.point_loss_streak',
                         'delta.cover_win_streak', 'delta.cover_loss_streak',
                         'delta.over_streak', 'delta.under_streak']
    interactions      :
        option        : True
        poly_degree   : 2
        sampling_pct  : 5
    isomap            :
        option        : False
        components    : 2
        neighbors     : 5
    logtransform      :
        option        : False
    numpy             :
        option        : False
    pca               :
        option        : False
        increment     : 3
        maximum       : 15
        minimum       : 3
        whiten        : False
    scaling           :
        option        : True
        type          : standard
    scipy             :
        option        : False
    text              :
        ngrams        : 1
        vectorize     : False
    tsne              :
        option        : False
        components    : 2
        learning_rate : 1000.0
        perplexity    : 30.0
    variance          :
        option        : True
        threshold     : 0.1

pipeline:
    number_jobs       : -1
    seed              : 13201
    verbosity         : 0

plots:
    calibration       : True
    confusion_matrix  : True
    importances       : True
    learning_curve    : True
    roc_curve         : True

xgboost:
    stopping_rounds   : 30

Creating the Model

First, change the directory to your project location, where you have already followed the Project Structure specifications:

cd path/to/project

Run this command to train a model:

sflow

Usage:

sflow [--train | --predict] [--tdate yyyy-mm-dd] [--pdate yyyy-mm-dd]
--train

Train a new model and make predictions (Default)

--predict

Make predictions from a saved model

--tdate

The training date in format YYYY-MM-DD (Default: Earliest Date in the Data)

--pdate

The prediction date in format YYYY-MM-DD (Default: Today’s Date)

Running the Model

In the project location, run sflow with the predict flag. SportFlow will automatically create the predict.csv file using the pdate option:

sflow --predict [--pdate yyyy-mm-dd]