Project Structure¶
Setup¶
Your initial configuration must have the following directories and files.
The directories config
, data
, and input
store input, and
the directories model
, output
, and plots
store output:
project
├── config
├── model.yml
├── algos.yml
└── data
└── input
├── train.csv
├── test.csv
└── model
└── output
└── plots
The top-level directory is the main project directory with a unique name. There are six required subdirectories:
config
:This directory contains all of the YAML files. At a minimum, it must contain
model.yml
andalgos.yml
.data
:If required, any data for the domain pipeline is stored here. Data from this directory will be transformed into
train.csv
andtest.csv
in theinput
directory.input
:The training file
train.csv
and the testing filetest.csv
are stored here. Note that these file names can be named anything as configured in themodel.yml
file.model
:The final model is dumped here as a pickle file in the format
model_[yyyymmdd].pkl
.output
:This directory contains predictions, probabilities, rankings, and any submission files:
predictions_[yyyymmdd].csv
probabilities_[yyyymmdd].csv
rankings_[yyyymmdd].csv
submission_[yyyymmdd].csv
plots
:All generated plots are stored here. The file name has the following elements:
plot name
‘train’ or ‘test’
algorithm abbreviation
format suffix
For example, a calibration plot for the testing data for all algorithms will be named
calibration_test.png
. The file name for a confusion matrix for XGBoost training data will beconfusion_train_XGB.png
.
Model Configuration¶
Here is an example of a model configuration file. It is written in YAML and is divided into logical sections reflecting the stages of the pipeline. Within each section, you can control different aspects for experimenting with model results. Please refer to the following sections for more detail.
project:
directory : .
file_extension : csv
submission_file : 'gender_submission'
submit_probas : False
data:
drop : ['PassengerId']
features : '*'
sampling :
option : False
method : under_random
ratio : 0.5
sentinel : -1
separator : ','
shuffle : False
split : 0.4
target : Survived
target_value : 1
model:
algorithms : ['RF', 'XGB']
balance_classes : True
calibration :
option : False
type : sigmoid
cv_folds : 3
estimators : 51
feature_selection :
option : False
percentage : 50
uni_grid : [5, 10, 15, 20, 25]
score_func : f_classif
grid_search :
option : True
iterations : 50
random : True
subsample : False
sampling_pct : 0.2
pvalue_level : 0.01
rfe :
option : True
step : 3
scoring_function : roc_auc
type : classification
features:
clustering :
option : True
increment : 3
maximum : 30
minimum : 3
counts :
option : True
encoding :
rounding : 2
type : factorize
factors : []
interactions :
option : True
poly_degree : 5
sampling_pct : 10
isomap :
option : False
components : 2
neighbors : 5
logtransform :
option : False
numpy :
option : True
pca :
option : False
increment : 1
maximum : 10
minimum : 2
whiten : False
scaling :
option : True
type : standard
scipy :
option : False
text :
ngrams : 3
vectorize : False
tsne :
option : False
components : 2
learning_rate : 1000.0
perplexity : 30.0
variance :
option : True
threshold : 0.1
pipeline:
number_jobs : -1
seed : 42
verbosity : 0
plots:
calibration : True
confusion_matrix : True
importances : True
learning_curve : True
roc_curve : True
xgboost:
stopping_rounds : 20
Project Section¶
The project
section has the following keys:
directory
:The full specification of the project location
file_extension
:The extension is usually
csv
but could also betsv
or other types using different delimiters between valuessubmission_file
:The file name of the submission template, which is usually provided in Kaggle competitions
submit_probas
:Set the value to
True
if submitting probabilities, or set toFalse
if the predictions are the actual labels or real values.
project:
directory : .
file_extension : csv
submission_file : 'gender_submission'
submit_probas : False
Warning
If you do not supply a value on the right-hand side of
the colon [:], then Python will interpret that key as having
a None
value, which is correct. Do not spell out None;
otherwise, the value will be interpreted as the string ‘None’.
Data Section¶
The data
section has the following keys:
drop
:A list of features to be dropped from the data frame
features
:A list of features for training.
'*'
means all features will be used in training.sampling
:Resample imbalanced classes with one of the sampling methods in
alphapy.data.SamplingMethod
sentinel
:The designated value to replace any missing values
separator
:The delimiter separating values in the training and test files
shuffle
:If
True
, randomly shuffle the data.split
:The proportion of data to include in training, which is a fraction between 0 and 1
target
:The name of the feature that designates the label to predict
target_value
:The value of the target label to predict
data:
drop : ['PassengerId']
features : '*'
sampling :
option : False
method : under_random
ratio : 0.5
sentinel : -1
separator : ','
shuffle : False
split : 0.4
target : Survived
target_value : 1
Model Section¶
The model
section has the following keys:
algorithms
:The list of algorithms to test for model selection. Refer to Algorithms Configuration for the abbreviation codes.
balance_classes
:If
True
, calculate sample weights to offset the majority class when training a model.calibration
:Calibrate final probabilities for a classification. Refer to the scikit-learn documentation for Calibration.
cv_folds
:The number of folds for cross-validation
estimators
:The number of estimators to be used in the machine learning algorithm, e.g., the number of trees in a random forest
feature_selection
:Perform univariate feature selection based on percentile. Refer to the scikit-learn documentation for FeatureSelection.
grid_search
:The grid search is either random with a fixed number of iterations, or it is a full grid search. Refer to the scikit-learn documentation for GridSearch.
pvalue_level
:The p-value threshold to determine whether or not a numerical feature is normally distributed.
rfe
:Perform Recursive Feature Elimination (RFE). Refer to the scikit-learn documentation for RecursiveFeatureElimination.
scoring_function
:The scoring function is an objective function for model evaluation. Use one of the values in ScoringFunction.
type
:The model type is either
classification
orregression
.
model:
algorithms : ['RF', 'XGB']
balance_classes : True
calibration :
option : False
type : sigmoid
cv_folds : 3
estimators : 51
feature_selection :
option : False
percentage : 50
uni_grid : [5, 10, 15, 20, 25]
score_func : f_classif
grid_search :
option : True
iterations : 50
random : True
subsample : False
sampling_pct : 0.2
pvalue_level : 0.01
rfe :
option : True
step : 3
scoring_function : roc_auc
type : classification
Features Section¶
The features
section has the following keys:
clustering
:For clustering, specify the minimum and maximum number of clusters and the increment from min-to-max.
counts
:Create features that record counts of the NA values, zero values, and the digits 1-9 in each row.
encoding
:Encode factors from features, selecting an encoding type and any rounding if necessary. Refer to
alphapy.features.Encoders
for the encoding type.factors
:The list of features that are factors.
interactions
:Calculate polynomical interactions of a given degree, and select the percentage of interactions included in the feature set.
isomap
:Use isomap embedding. Refer to isomap.
logtransform
:For numerical features that do not fit a normal distribution, perform a log transformation.
numpy
:Calculate the total, mean, standard deviation, and variance of each row.
pca
:For Principal Component Analysis, specify the minimum and maximum number of components, the increment from min-to-max, and whether or not whitening is applied.
scaling
:To scale features, specify
standard
orminmax
.scipy
:Calculate skew and kurtosis for row distributions.
text
:If there are text features, then apply vectorization and TF-IDF. If vectorization does not work, then apply factorization.
tsne
:Perform t-distributed Stochastic Neighbor Embedding (TSNE), which can be very memory-intensive. Refer to TSNE.
variance
:Remove low-variance features using a specified threshold. Refer to VAR.
features:
clustering :
option : True
increment : 3
maximum : 30
minimum : 3
counts :
option : True
encoding :
rounding : 2
type : factorize
factors : []
interactions :
option : True
poly_degree : 5
sampling_pct : 10
isomap :
option : False
components : 2
neighbors : 5
logtransform :
option : False
numpy :
option : True
pca :
option : False
increment : 1
maximum : 10
minimum : 2
whiten : False
scaling :
option : True
type : standard
scipy :
option : False
text :
ngrams : 3
vectorize : False
tsne :
option : False
components : 2
learning_rate : 1000.0
perplexity : 30.0
variance :
option : True
threshold : 0.1
Treatments Section¶
Treatments are special functions for feature extraction. In the
treatments
section below, we are applying treatments to two
features doji and hc. Within the Python list, we are calling
the runs_test
function of the module alphapy.features. The
module name is always the first element of the list, and the
the function name is always the second element of the list. The
remaining elements of the list are the actual parameters to the
function.
treatments:
doji : ['alphapy.features', 'runs_test', ['all'], 18]
hc : ['alphapy.features', 'runs_test', ['all'], 18]
Here is the code for the runs_test
function, which calculates
runs for Boolean features. For a treatment function, the first and
second arguments are always the same. The first argument f
is
the data frame, and the second argument c
is the column (or feature)
to which we are going to apply the treatment. The remaining function
arguments correspond to the actual parameters that were specified
in the configuration file, in this case wfuncs
and window
.
def runs_test(f, c, wfuncs, window):
fc = f[c]
all_funcs = {'runs' : runs,
'streak' : streak,
'rtotal' : rtotal,
'zscore' : zscore}
# use all functions
if 'all' in wfuncs:
wfuncs = all_funcs.keys()
# apply each of the runs functions
new_features = pd.DataFrame()
for w in wfuncs:
if w in all_funcs:
new_feature = fc.rolling(window=window).apply(all_funcs[w])
new_feature.fillna(0, inplace=True)
frames = [new_features, new_feature]
new_features = pd.concat(frames, axis=1)
else:
logger.info("Runs Function %s not found", w)
return new_features
When the runs_test
function is invoked, a new data frame is
created, as multiple feature columns may be generated from a
single treatment function. These new features are returned and
appended to the original data frame.
Pipeline Section¶
The pipeline
section has the following keys:
number_jobs
:Number of jobs to run in parallel [-1 for all cores]
seed
:A random seed integer to ensure reproducible results
verbosity
:The logging level from 0 (no logging) to 10 (highest)
pipeline:
number_jobs : -1
seed : 42
verbosity : 0
Plots Section¶
To turn on the automatic generation of any plot in the plots
section, simply set the corresponding value to True
.
plots:
calibration : True
confusion_matrix : True
importances : True
learning_curve : True
roc_curve : True
XGBoost Section¶
The xgboost
section has the following keys:
stopping_rounds
:early stopping rounds for XGBoost
xgboost:
stopping_rounds : 20
Algorithms Configuration¶
Each algorithm has its own section in the algos.yml
file, e.g.,
AB or RF. The following elements are required for every
algorithm entry in the YAML file:
model_type
:Specify
classification
orregression
params
The initial parameters for the first fitting
grid
:The grid search dictionary for hyperparameter tuning of an estimator. If you are using randomized grid search, then make sure that the total number of grid combinations exceeds the number of random iterations.
scoring
:Set to
True
if a specific scoring function will be applied.
Note
The parameters n_estimators
, n_jobs
, seed
, and
verbosity
are informed by the model.yml
file. When the
estimators are created, the proper values for these parameters are
automatically substituted in the algos.yml
file on a global
basis.
#
# Algorithms
#
AB:
# AdaBoost
model_type : classification
params : {"n_estimators" : n_estimators,
"random_state" : seed}
grid : {"n_estimators" : [10, 50, 100, 150, 200],
"learning_rate" : [0.2, 0.5, 0.7, 1.0, 1.5, 2.0],
"algorithm" : ['SAMME', 'SAMME.R']}
scoring : True
GB:
# Gradient Boosting
model_type : classification
params : {"n_estimators" : n_estimators,
"max_depth" : 3,
"random_state" : seed,
"verbose" : verbosity}
grid : {"loss" : ['deviance', 'exponential'],
"learning_rate" : [0.05, 0.1, 0.15],
"n_estimators" : [50, 100, 200],
"max_depth" : [3, 5, 10],
"min_samples_split" : [2, 3],
"min_samples_leaf" : [1, 2]}
scoring : True
GBR:
# Gradient Boosting Regression
model_type : regression
params : {"n_estimators" : n_estimators,
"random_state" : seed,
"verbose" : verbosity}
grid : {}
scoring : False
KNN:
# K-Nearest Neighbors
model_type : classification
params : {"n_jobs" : n_jobs}
grid : {"n_neighbors" : [3, 5, 7, 10],
"weights" : ['uniform', 'distance'],
"algorithm" : ['ball_tree', 'kd_tree', 'brute', 'auto'],
"leaf_size" : [10, 20, 30, 40, 50]}
scoring : False
KNR:
# K-Nearest Neighbor Regression
model_type : regression
params : {"n_jobs" : n_jobs}
grid : {}
scoring : False
LOGR:
# Logistic Regression
model_type : classification
params : {"random_state" : seed,
"n_jobs" : n_jobs,
"verbose" : verbosity}
grid : {"penalty" : ['l2'],
"C" : [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 1e4, 1e5, 1e6, 1e7],
"fit_intercept" : [True, False],
"solver" : ['newton-cg', 'lbfgs', 'liblinear', 'sag']}
scoring : True
LR:
# Linear Regression
model_type : regression
params : {"n_jobs" : n_jobs}
grid : {"fit_intercept" : [True, False],
"normalize" : [True, False],
"copy_X" : [True, False]}
scoring : False
LSVC:
# Linear Support Vector Classification
model_type : classification
params : {"C" : 0.01,
"max_iter" : 2000,
"penalty" : 'l1',
"dual" : False,
"random_state" : seed,
"verbose" : verbosity}
grid : {"C" : np.logspace(-2, 10, 13),
"penalty" : ['l1', 'l2'],
"dual" : [True, False],
"tol" : [0.0005, 0.001, 0.005],
"max_iter" : [500, 1000, 2000]}
scoring : False
LSVM:
# Linear Support Vector Machine
model_type : classification
params : {"kernel" : 'linear',
"probability" : True,
"random_state" : seed,
"verbose" : verbosity}
grid : {"C" : np.logspace(-2, 10, 13),
"gamma" : np.logspace(-9, 3, 13),
"shrinking" : [True, False],
"tol" : [0.0005, 0.001, 0.005],
"decision_function_shape" : ['ovo', 'ovr']}
scoring : False
NB:
# Naive Bayes
model_type : classification
params : {}
grid : {"alpha" : [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1.0, 2.0, 5.0, 10.0],
"fit_prior" : [True, False]}
scoring : True
RBF:
# Radial Basis Function
model_type : classification
params : {"kernel" : 'rbf',
"probability" : True,
"random_state" : seed,
"verbose" : verbosity}
grid : {"C" : np.logspace(-2, 10, 13),
"gamma" : np.logspace(-9, 3, 13),
"shrinking" : [True, False],
"tol" : [0.0005, 0.001, 0.005],
"decision_function_shape" : ['ovo', 'ovr']}
scoring : False
RF:
# Random Forest
model_type : classification
params : {"n_estimators" : n_estimators,
"max_depth" : 10,
"min_samples_split" : 5,
"min_samples_leaf" : 3,
"bootstrap" : True,
"criterion" : 'entropy',
"random_state" : seed,
"n_jobs" : n_jobs,
"verbose" : verbosity}
grid : {"n_estimators" : [21, 51, 101, 201, 501],
"max_depth" : [5, 7, 10, 20],
"min_samples_split" : [2, 3, 5, 10],
"min_samples_leaf" : [1, 2, 3],
"bootstrap" : [True, False],
"criterion" : ['gini', 'entropy']}
scoring : True
RFR:
# Random Forest Regression
model_type : regression
params : {"n_estimators" : n_estimators,
"random_state" : seed,
"n_jobs" : n_jobs,
"verbose" : verbosity}
grid : {}
scoring : False
SVM:
# Support Vector Machine
model_type : classification
params : {"probability" : True,
"random_state" : seed,
"verbose" : verbosity}
grid : {"C" : np.logspace(-2, 10, 13),
"gamma" : np.logspace(-9, 3, 13),
"shrinking" : [True, False],
"tol" : [0.0005, 0.001, 0.005],
"decision_function_shape" : ['ovo', 'ovr']}
scoring : False
TF_DNN:
# Google TensorFlow Deep Neural Network
model_type : classification
params : {"feature_columns" : [tf.contrib.layers.real_valued_column("", dimension=4)],
"n_classes" : 2,
"hidden_units" : [20, 40, 20]}
grid : {}
scoring : False
XGB:
# XGBoost Binary
model_type : classification
params : {"objective" : 'binary:logistic',
"n_estimators" : n_estimators,
"seed" : seed,
"max_depth" : 6,
"learning_rate" : 0.1,
"min_child_weight" : 1.1,
"subsample" : 0.9,
"colsample_bytree" : 0.9,
"nthread" : n_jobs,
"silent" : True}
grid : {"n_estimators" : [21, 51, 101, 201, 501],
"max_depth" : [5, 6, 7, 8, 9, 10, 12, 15, 20],
"learning_rate" : [0.01, 0.02, 0.05, 0.1, 0.2],
"min_child_weight" : [1.0, 1.1],
"subsample" : [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree" : [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}
scoring : False
XGBM:
# XGBoost Multiclass
model_type : multiclass
params : {"objective" : 'multi:softmax',
"n_estimators" : n_estimators,
"seed" : seed,
"max_depth" : 10,
"learning_rate" : 0.1,
"min_child_weight" : 1.1,
"subsample" : 0.9,
"colsample_bytree" : 0.9,
"nthread" : n_jobs,
"silent" : True}
grid : {}
scoring : False
XGBR:
# XGBoost Regression
model_type : regression
params : {"objective" : 'reg:linear',
"n_estimators" : n_estimators,
"seed" : seed,
"max_depth" : 10,
"learning_rate" : 0.1,
"min_child_weight" : 1.1,
"subsample" : 0.9,
"colsample_bytree" : 0.9,
"seed" : seed,
"nthread" : n_jobs,
"silent" : True}
grid : {}
scoring : False
XT:
# Extra Trees
model_type : classification
params : {"n_estimators" : n_estimators,
"random_state" : seed,
"n_jobs" : n_jobs,
"verbose" : verbosity}
grid : {"n_estimators" : [21, 51, 101, 201, 501, 1001, 2001],
"max_features" : ['auto', 'sqrt', 'log2'],
"max_depth" : [3, 5, 7, 10, 20, 30],
"min_samples_split" : [2, 3],
"min_samples_leaf" : [1, 2],
"bootstrap" : [True, False],
"warm_start" : [True, False]}
scoring : True
XTR:
# Extra Trees Regression
model_type : regression
params : {"n_estimators" : n_estimators,
"random_state" : seed,
"n_jobs" : n_jobs,
"verbose" : verbosity}
grid : {}
scoring : False
Final Output¶
This is an example of your file structure after running the pipeline:
project
├── alphapy.log
├── config
├── algos.yml
├── model.yml
└── data
└── input
├── test.csv
├── train.csv
└── model
├── feature_map_20170325.pkl
├── model_20170325.pkl
└── output
├── predictions_20170325.csv
├── probabilities_20170325.csv
├── rankings_20170325.csv
├── submission_20170325.csv
└── plots
├── calibration_train.png
├── confusion_train_RF.png
├── confusion_train_XGB.png
├── feature_importance_train_RF.png
├── feature_importance_train_XGB.png
├── learning_curve_train_RF.png
├── learning_curve_train_XGB.png
├── roc_curve_train.png