SportFlow

SportFlow applies machine learning algorithms to predict game outcomes for matches in any team sport. We created binary features (for classification) to determine whether or not a team will win the game or even more importantly, cover the spread. We also try to predict whether or not a game’s total points will exceed the over/under.

Of course, there are practical matters to predicting a game’s outcome. The strength of supervised learning is to improve an algorithm’s performance with lots of data. While major-league baseball has a total of 2,430 games per year, pro football has only 256 games per year. College football and basketball are somewhere in the middle of this range.

The other complication is determining whether or not a model for one sport can be used for another. The advantage is that combining sports gives us more data. The disadvantage is that each sport has unique characteristics that could make a unified model infeasible. Still, we can combine the game data to test an overall model.

Data Sources

SportFlow starts with minimal game data (lines and scores) and expands these data into temporal features such as runs and streaks for all of the features. Currently, we do not incorporate player data or other external factors, but there are some excellent open-source packages such as BurntSushi’s nflgame Python code. For its initial version, SportFlow game data must be in the format below:

NCAA Basketball Data
season	date	away.team	away.score	home.team	home.score	line	over_under
2015	2015-11-13	COLO	62	ISU	68	-10	151
2015	2015-11-13	SDAK	69	WRST	77	-6.5	136
2015	2015-11-13	WAG	57	SJU	66	-5.5	142
2015	2015-11-13	JVST	83	CMU	89	-18	142.5
2015	2015-11-13	NIAG	50	ODU	67	-18	132
2015	2015-11-13	ALBY	65	UK	78	-20	132.5
2015	2015-11-13	TEM	67	UNC	91	-9.5	145
2015	2015-11-13	NKU	61	WVU	107	-23.5	147.5
2015	2015-11-13	SIE	74	DUKE	92	-24	155
2015	2015-11-13	WCU	72	CIN	97	-20	132
2015	2015-11-13	MSM	56	MD	80	-21.5	140
2015	2015-11-13	CHAT	92	UGA	90	-10.5	136
2015	2015-11-13	SEMO	53	DAY	84	-19	140
2015	2015-11-13	DART	67	HALL	84	-11	136
2015	2015-11-13	CAN	85	HOF	96	-10.5	150.5
2015	2015-11-13	JMU	87	RICH	75	-9	137.5
2015	2015-11-13	EIU	49	IND	88	-25	150
2015	2015-11-13	FAU	55	MSU	82	-23.5	141
2015	2015-11-13	SAM	45	LOU	86	-23	142
2015	2015-11-13	MIOH	72	XAV	81	-15.5	144.5
2015	2015-11-13	PRIN	64	RID	56	1	137
2015	2015-11-13	IUPU	72	INST	70	-8	135.5
2015	2015-11-13	SAC	66	ASU	63	-18	144
2015	2015-11-13	AFA	75	SIU	77	-5.5	131
2015	2015-11-13	UNCO	72	KU	109	-29	147.5
2015	2015-11-13	BALL	53	BRAD	54	3	135
2015	2015-11-13	USD	45	USC	83	-12.5	140
2015	2015-11-13	UTM	57	OKST	91	-12	141.5
2015	2015-11-13	COR	81	GT	116	-17	130
2015	2015-11-13	MOST	65	ORU	80	-4.5	133.5
2015	2015-11-13	DREX	81	JOES	82	-9.5	127.5
2015	2015-11-13	WMRY	85	NCST	68	-12.5	149
2015	2015-11-13	SF	78	UIC	75	1.5	148.5
2015	2015-11-13	PEAY	41	VAN	80	-24.5	144
2015	2015-11-13	CSN	71	NIU	83	-9.5	134.5
2015	2015-11-13	UCSB	60	OMA	59	-2.5	157.5
2015	2015-11-13	UTSA	64	LOYI	76	-14	138.5
2015	2015-11-13	BRWN	65	SPU	77	-2	130.5
2015	2015-11-13	NAU	70	WSU	82	-10.5	145

The SportFlow logic is split-apply-combine, as the data are first split along team lines, then team statistics are calculated and applied, and finally the team data are inserted into the overall model frame.

Domain Configuration

The SportFlow configuration file is minimal. You can simulate random scoring to compare with a real model. Further, you can experiment with the rolling window for run and streak calculations.

sport.yml

sport:
    points_max      : 100
    points_min      : 50
    random_scoring  : False
    seasons         : []
    rolling_window  : 3

points_max:: Maximum number of simulated points to assign to any single team.
points_min:: Minimum number of simulated points to assign to any single team.
random_scoring:: If True, assign random point values to games [Default: False].
seasons:: The yearly list of seasons to evaluate.
rolling_window:: The period over which streaks are calculated.

Model Configuration

SportFlow runs on top of AlphaPy, so the model.yml file has the same format.

model.yml

project:
    directory         : .
    file_extension    : csv
    submission_file   : 
    submit_probas     : False

data:
    drop              : ['Unnamed: 0', 'index', 'season', 'date', 'home.team', 'away.team',
                         'home.score', 'away.score', 'total_points', 'point_margin_game',
                         'won_on_points', 'lost_on_points', 'cover_margin_game',
                         'lost_on_spread', 'overunder_margin', 'over', 'under']
    features          : '*'
    sampling          :
        option        : False
        method        : under_random
        ratio         : 0.0
    sentinel          : -1
    separator         : ','
    shuffle           : False
    split             : 0.4
    target            : won_on_spread
    target_value      : True

model:
    algorithms        : ['RF', 'XGB']
    balance_classes   : False
    calibration       :
        option        : False
        type          : isotonic
    cv_folds          : 3
    estimators        : 201
    feature_selection :
        option        : False
        percentage    : 50
        uni_grid      : [5, 10, 15, 20, 25]
        score_func    : f_classif
    grid_search       :
        option        : True
        iterations    : 50
        random        : True
        subsample     : False
        sampling_pct  : 0.25
    pvalue_level      : 0.01
    rfe               :
        option        : True
        step          : 5
    scoring_function  : 'roc_auc'
    type              : classification

features:
    clustering        :
        option        : False
        increment     : 3
        maximum       : 30
        minimum       : 3
    counts            :
        option        : False
    encoding          :
        rounding      : 3
        type          : factorize
    factors           : ['line', 'delta.wins', 'delta.losses', 'delta.ties',
                         'delta.point_win_streak', 'delta.point_loss_streak',
                         'delta.cover_win_streak', 'delta.cover_loss_streak',
                         'delta.over_streak', 'delta.under_streak']
    interactions      :
        option        : True
        poly_degree   : 2
        sampling_pct  : 5
    isomap            :
        option        : False
        components    : 2
        neighbors     : 5
    logtransform      :
        option        : False
    numpy             :
        option        : False
    pca               :
        option        : False
        increment     : 3
        maximum       : 15
        minimum       : 3
        whiten        : False
    scaling           :
        option        : True
        type          : standard
    scipy             :
        option        : False
    text              :
        ngrams        : 1
        vectorize     : False
    tsne              :
        option        : False
        components    : 2
        learning_rate : 1000.0
        perplexity    : 30.0
    variance          :
        option        : True
        threshold     : 0.1

pipeline:
    number_jobs       : -1
    seed              : 13201
    verbosity         : 0

plots:
    calibration       : True
    confusion_matrix  : True
    importances       : True
    learning_curve    : True
    roc_curve         : True

xgboost:
    stopping_rounds   : 30

Creating the Model

First, change the directory to your project location, where you have already followed the Project Structure specifications:

cd path/to/project

Run this command to train a model:

sflow

Usage:

sflow [--train | --predict] [--tdate yyyy-mm-dd] [--pdate yyyy-mm-dd]

--train: Train a new model and make predictions (Default)
--predict: Make predictions from a saved model
--tdate: The training date in format YYYY-MM-DD (Default: Earliest Date in the Data)
--pdate: The prediction date in format YYYY-MM-DD (Default: Today’s Date)

Running the Model

In the project location, run sflow with the predict flag. SportFlow will automatically create the predict.csv file using the pdate option:

sflow --predict [--pdate yyyy-mm-dd]