Project Structure

Setup

Your initial configuration must have the following directories and files. The directories config, data, and input store input, and the directories model, output, and plots store output:

project
├── config
    ├── model.yml
    ├── algos.yml
└── data
└── input
    ├── train.csv
    ├── test.csv
└── model
└── output
└── plots

The top-level directory is the main project directory with a unique name. There are six required subdirectories:

config:
This directory contains all of the YAML files. At a minimum, it must contain model.yml and algos.yml.
data:
If required, any data for the domain pipeline is stored here. Data from this directory will be transformed into train.csv and test.csv in the input directory.
input:
The training file train.csv and the testing file test.csv are stored here. Note that these file names can be named anything as configured in the model.yml file.
model:
The final model is dumped here as a pickle file in the format model_[yyyymmdd].pkl.
output:

This directory contains predictions, probabilities, rankings, and any submission files:

  • predictions_[yyyymmdd].csv
  • probabilities_[yyyymmdd].csv
  • rankings_[yyyymmdd].csv
  • submission_[yyyymmdd].csv
plots:

All generated plots are stored here. The file name has the following elements:

  • plot name
  • ‘train’ or ‘test’
  • algorithm abbreviation
  • format suffix

For example, a calibration plot for the testing data for all algorithms will be named calibration_test.png. The file name for a confusion matrix for XGBoost training data will be confusion_train_XGB.png.

Model Configuration

Here is an example of a model configuration file. It is written in YAML and is divided into logical sections reflecting the stages of the pipeline. Within each section, you can control different aspects for experimenting with model results. Please refer to the following sections for more detail.

model.yml
project:
    directory         : .
    file_extension    : csv
    submission_file   : 'gender_submission'
    submit_probas     : False

data:
    drop              : ['PassengerId']
    features          : '*'
    sampling          :
        option        : False
        method        : under_random
        ratio         : 0.5
    sentinel          : -1
    separator         : ','
    shuffle           : False
    split             : 0.4
    target            : Survived
    target_value      : 1

model:
    algorithms        : ['RF', 'XGB']
    balance_classes   : True
    calibration       :
        option        : False
        type          : sigmoid
    cv_folds          : 3
    estimators        : 51
    feature_selection :
        option        : False
        percentage    : 50
        uni_grid      : [5, 10, 15, 20, 25]
        score_func    : f_classif
    grid_search       :
        option        : True
        iterations    : 50
        random        : True
        subsample     : False
        sampling_pct  : 0.2
    pvalue_level      : 0.01
    rfe               :
        option        : True
        step          : 3
    scoring_function  : roc_auc
    type              : classification

features:
    clustering        :
        option        : True
        increment     : 3
        maximum       : 30
        minimum       : 3
    counts            :
        option        : True
    encoding          :
        rounding      : 2
        type          : factorize
    factors           : []
    interactions      :
        option        : True
        poly_degree   : 5
        sampling_pct  : 10
    isomap            :
        option        : False
        components    : 2
        neighbors     : 5
    logtransform      :
        option        : False
    numpy             :
        option        : True
    pca               :
        option        : False
        increment     : 1
        maximum       : 10
        minimum       : 2
        whiten        : False
    scaling           :
        option        : True
        type          : standard
    scipy             :
        option        : False
    text              :
        ngrams        : 3
        vectorize     : False
    tsne              :
        option        : False
        components    : 2
        learning_rate : 1000.0
        perplexity    : 30.0
    variance          :
        option        : True
        threshold     : 0.1

pipeline:
    number_jobs       : -1
    seed              : 42
    verbosity         : 0

plots:
    calibration       : True
    confusion_matrix  : True
    importances       : True
    learning_curve    : True
    roc_curve         : True

xgboost:
    stopping_rounds   : 20

Project Section

The project section has the following keys:

directory:
The full specification of the project location
file_extension:
The extension is usually csv but could also be tsv or other types using different delimiters between values
submission_file:
The file name of the submission template, which is usually provided in Kaggle competitions
submit_probas:
Set the value to True if submitting probabilities, or set to False if the predictions are the actual labels or real values.
model.yml
project:
    directory         : .
    file_extension    : csv
    submission_file   : 'gender_submission'
    submit_probas     : False

Warning

If you do not supply a value on the right-hand side of the colon [:], then Python will interpret that key as having a None value, which is correct. Do not spell out None; otherwise, the value will be interpreted as the string ‘None’.

Data Section

The data section has the following keys:

drop:
A list of features to be dropped from the data frame
features:
A list of features for training. '*' means all features will be used in training.
sampling:
Resample imbalanced classes with one of the sampling methods in alphapy.data.SamplingMethod
sentinel:
The designated value to replace any missing values
separator:
The delimiter separating values in the training and test files
shuffle:
If True, randomly shuffle the data.
split:
The proportion of data to include in training, which is a fraction between 0 and 1
target:
The name of the feature that designates the label to predict
target_value:
The value of the target label to predict
model.yml
data:
    drop              : ['PassengerId']
    features          : '*'
    sampling          :
        option        : False
        method        : under_random
        ratio         : 0.5
    sentinel          : -1
    separator         : ','
    shuffle           : False
    split             : 0.4
    target            : Survived
    target_value      : 1

Model Section

The model section has the following keys:

algorithms:
The list of algorithms to test for model selection. Refer to Algorithms Configuration for the abbreviation codes.
balance_classes:
If True, calculate sample weights to offset the majority class when training a model.
calibration:
Calibrate final probabilities for a classification. Refer to the scikit-learn documentation for Calibration.
cv_folds:
The number of folds for cross-validation
estimators:
The number of estimators to be used in the machine learning algorithm, e.g., the number of trees in a random forest
feature_selection:
Perform univariate feature selection based on percentile. Refer to the scikit-learn documentation for FeatureSelection.
grid_search:
The grid search is either random with a fixed number of iterations, or it is a full grid search. Refer to the scikit-learn documentation for GridSearch.
pvalue_level:
The p-value threshold to determine whether or not a numerical feature is normally distributed.
rfe:
Perform Recursive Feature Elimination (RFE). Refer to the scikit-learn documentation for RecursiveFeatureElimination.
scoring_function:
The scoring function is an objective function for model evaluation. Use one of the values in ScoringFunction.
type:
The model type is either classification or regression.
model.yml
model:
    algorithms        : ['RF', 'XGB']
    balance_classes   : True
    calibration       :
        option        : False
        type          : sigmoid
    cv_folds          : 3
    estimators        : 51
    feature_selection :
        option        : False
        percentage    : 50
        uni_grid      : [5, 10, 15, 20, 25]
        score_func    : f_classif
    grid_search       :
        option        : True
        iterations    : 50
        random        : True
        subsample     : False
        sampling_pct  : 0.2
    pvalue_level      : 0.01
    rfe               :
        option        : True
        step          : 3
    scoring_function  : roc_auc
    type              : classification

Features Section

The features section has the following keys:

clustering:
For clustering, specify the minimum and maximum number of clusters and the increment from min-to-max.
counts:
Create features that record counts of the NA values, zero values, and the digits 1-9 in each row.
encoding:
Encode factors from features, selecting an encoding type and any rounding if necessary. Refer to alphapy.features.Encoders for the encoding type.
factors:
The list of features that are factors.
interactions:
Calculate polynomical interactions of a given degree, and select the percentage of interactions included in the feature set.
isomap:
Use isomap embedding. Refer to isomap.
logtransform:
For numerical features that do not fit a normal distribution, perform a log transformation.
numpy:
Calculate the total, mean, standard deviation, and variance of each row.
pca:
For Principal Component Analysis, specify the minimum and maximum number of components, the increment from min-to-max, and whether or not whitening is applied.
scaling:
To scale features, specify standard or minmax.
scipy:
Calculate skew and kurtosis for row distributions.
text:
If there are text features, then apply vectorization and TF-IDF. If vectorization does not work, then apply factorization.
tsne:
Perform t-distributed Stochastic Neighbor Embedding (TSNE), which can be very memory-intensive. Refer to TSNE.
variance:
Remove low-variance features using a specified threshold. Refer to VAR.
model.yml
features:
    clustering        :
        option        : True
        increment     : 3
        maximum       : 30
        minimum       : 3
    counts            :
        option        : True
    encoding          :
        rounding      : 2
        type          : factorize
    factors           : []
    interactions      :
        option        : True
        poly_degree   : 5
        sampling_pct  : 10
    isomap            :
        option        : False
        components    : 2
        neighbors     : 5
    logtransform      :
        option        : False
    numpy             :
        option        : True
    pca               :
        option        : False
        increment     : 1
        maximum       : 10
        minimum       : 2
        whiten        : False
    scaling           :
        option        : True
        type          : standard
    scipy             :
        option        : False
    text              :
        ngrams        : 3
        vectorize     : False
    tsne              :
        option        : False
        components    : 2
        learning_rate : 1000.0
        perplexity    : 30.0
    variance          :
        option        : True
        threshold     : 0.1

Treatments Section

Treatments are special functions for feature extraction. In the treatments section below, we are applying treatments to two features doji and hc. Within the Python list, we are calling the runs_test function of the module alphapy.features. The module name is always the first element of the list, and the the function name is always the second element of the list. The remaining elements of the list are the actual parameters to the function.

model.yml
 treatments:
     doji : ['alphapy.features', 'runs_test', ['all'], 18]
     hc   : ['alphapy.features', 'runs_test', ['all'], 18]

Here is the code for the runs_test function, which calculates runs for Boolean features. For a treatment function, the first and second arguments are always the same. The first argument f is the data frame, and the second argument c is the column (or feature) to which we are going to apply the treatment. The remaining function arguments correspond to the actual parameters that were specified in the configuration file, in this case wfuncs and window.

features.py
 def runs_test(f, c, wfuncs, window):
     fc = f[c]
     all_funcs = {'runs'   : runs,
                  'streak' : streak,
                  'rtotal' : rtotal,
                  'zscore' : zscore}
     # use all functions
     if 'all' in wfuncs:
         wfuncs = all_funcs.keys()
     # apply each of the runs functions
     new_features = pd.DataFrame()
     for w in wfuncs:
         if w in all_funcs:
             new_feature = fc.rolling(window=window).apply(all_funcs[w])
             new_feature.fillna(0, inplace=True)
             frames = [new_features, new_feature]
             new_features = pd.concat(frames, axis=1)
         else:
             logger.info("Runs Function %s not found", w)
     return new_features

When the runs_test function is invoked, a new data frame is created, as multiple feature columns may be generated from a single treatment function. These new features are returned and appended to the original data frame.

Pipeline Section

The pipeline section has the following keys:

number_jobs:
Number of jobs to run in parallel [-1 for all cores]
seed:
A random seed integer to ensure reproducible results
verbosity:
The logging level from 0 (no logging) to 10 (highest)
model.yml
pipeline:
    number_jobs       : -1
    seed              : 42
    verbosity         : 0

Plots Section

To turn on the automatic generation of any plot in the plots section, simply set the corresponding value to True.

model.yml
plots:
    calibration       : True
    confusion_matrix  : True
    importances       : True
    learning_curve    : True
    roc_curve         : True

XGBoost Section

The xgboost section has the following keys:

stopping_rounds:
early stopping rounds for XGBoost
model.yml
xgboost:
    stopping_rounds   : 20

Algorithms Configuration

Each algorithm has its own section in the algos.yml file, e.g., AB or RF. The following elements are required for every algorithm entry in the YAML file:

model_type:
Specify classification or regression
params
The initial parameters for the first fitting
grid:
The grid search dictionary for hyperparameter tuning of an estimator. If you are using randomized grid search, then make sure that the total number of grid combinations exceeds the number of random iterations.
scoring:
Set to True if a specific scoring function will be applied.

Note

The parameters n_estimators, n_jobs, seed, and verbosity are informed by the model.yml file. When the estimators are created, the proper values for these parameters are automatically substituted in the algos.yml file on a global basis.

algos.yml
#
# Algorithms
#

AB:
    # AdaBoost
    model_type : classification
    params     : {"n_estimators" : n_estimators,
                  "random_state" : seed}
    grid       : {"n_estimators" : [10, 50, 100, 150, 200],
                  "learning_rate" : [0.2, 0.5, 0.7, 1.0, 1.5, 2.0],
                  "algorithm" : ['SAMME', 'SAMME.R']}
    scoring    : True

GB:
    # Gradient Boosting
    model_type : classification
    params     : {"n_estimators" : n_estimators,
                  "max_depth" : 3,
                  "random_state" : seed,
                  "verbose" : verbosity}
    grid       : {"loss" : ['deviance', 'exponential'],
                  "learning_rate" : [0.05, 0.1, 0.15],
                  "n_estimators" : [50, 100, 200],
                  "max_depth" : [3, 5, 10],
                  "min_samples_split" : [2, 3],
                  "min_samples_leaf" : [1, 2]}
    scoring    : True

GBR:
    # Gradient Boosting Regression
    model_type : regression
    params     : {"n_estimators" : n_estimators,
                  "random_state" : seed,
                  "verbose" : verbosity}
    grid       : {}
    scoring    : False

KNN:
    # K-Nearest Neighbors
    model_type : classification
    params     : {"n_jobs" : n_jobs}
    grid       : {"n_neighbors" : [3, 5, 7, 10],
                  "weights" : ['uniform', 'distance'],
                  "algorithm" : ['ball_tree', 'kd_tree', 'brute', 'auto'],
                  "leaf_size" : [10, 20, 30, 40, 50]}
    scoring    : False

KNR:
    # K-Nearest Neighbor Regression
    model_type : regression
    params     : {"n_jobs" : n_jobs}
    grid       : {}
    scoring    : False

LOGR:
    # Logistic Regression
    model_type : classification
    params     : {"random_state" : seed,
                  "n_jobs" : n_jobs,
                  "verbose" : verbosity}
    grid       : {"penalty" : ['l2'],
                  "C" : [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 1e4, 1e5, 1e6, 1e7],
                  "fit_intercept" : [True, False],
                  "solver" : ['newton-cg', 'lbfgs', 'liblinear', 'sag']}
    scoring    : True

LR:
    # Linear Regression
    model_type : regression
    params     : {"n_jobs" : n_jobs}
    grid       : {"fit_intercept" : [True, False],
                  "normalize" : [True, False],
                  "copy_X" : [True, False]}
    scoring    : False

LSVC:
    # Linear Support Vector Classification
    model_type : classification
    params     : {"C" : 0.01,
                  "max_iter" : 2000,
                  "penalty" : 'l1',
                  "dual" : False,
                  "random_state" : seed,
                  "verbose" : verbosity}
    grid       : {"C" : np.logspace(-2, 10, 13),
                  "penalty" : ['l1', 'l2'],
                  "dual" : [True, False],
                  "tol" : [0.0005, 0.001, 0.005],
                  "max_iter" : [500, 1000, 2000]}
    scoring    : False

LSVM:
    # Linear Support Vector Machine
    model_type : classification
    params     : {"kernel" : 'linear',
                  "probability" : True,
                  "random_state" : seed,
                  "verbose" : verbosity}
    grid       : {"C" : np.logspace(-2, 10, 13),
                  "gamma" : np.logspace(-9, 3, 13),
                  "shrinking" : [True, False],
                  "tol" : [0.0005, 0.001, 0.005],
                  "decision_function_shape" : ['ovo', 'ovr']}
    scoring    : False

NB:
    # Naive Bayes
    model_type : classification
    params     : {}
    grid       : {"alpha" : [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1.0, 2.0, 5.0, 10.0],
                  "fit_prior" : [True, False]}
    scoring    : True

RBF:
    # Radial Basis Function
    model_type : classification
    params     : {"kernel" : 'rbf',
                  "probability" : True,
                  "random_state" : seed,
                  "verbose" : verbosity}
    grid       : {"C" : np.logspace(-2, 10, 13),
                  "gamma" : np.logspace(-9, 3, 13),
                  "shrinking" : [True, False],
                  "tol" : [0.0005, 0.001, 0.005],
                  "decision_function_shape" : ['ovo', 'ovr']}
    scoring    : False

RF:
    # Random Forest
    model_type : classification
    params     : {"n_estimators" : n_estimators,
                  "max_depth" : 10,
                  "min_samples_split" : 5,
                  "min_samples_leaf" : 3,
                  "bootstrap" : True,
                  "criterion" : 'entropy',
                  "random_state" : seed,
                  "n_jobs" : n_jobs,
                  "verbose" : verbosity}
    grid       : {"n_estimators" : [21, 51, 101, 201, 501],
                  "max_depth" : [5, 7, 10, 20],
                  "min_samples_split" : [2, 3, 5, 10],
                  "min_samples_leaf" : [1, 2, 3],
                  "bootstrap" : [True, False],
                  "criterion" : ['gini', 'entropy']}
    scoring    : True

RFR:
    # Random Forest Regression
    model_type : regression
    params     : {"n_estimators" : n_estimators,
                  "random_state" : seed,
                  "n_jobs" : n_jobs,
                  "verbose" : verbosity}
    grid       : {}
    scoring    : False

SVM:
    # Support Vector Machine
    model_type : classification
    params     : {"probability" : True,
                  "random_state" : seed,
                  "verbose" : verbosity}
    grid       : {"C" : np.logspace(-2, 10, 13),
                  "gamma" : np.logspace(-9, 3, 13),
                  "shrinking" : [True, False],
                  "tol" : [0.0005, 0.001, 0.005],
                  "decision_function_shape" : ['ovo', 'ovr']}
    scoring    : False

TF_DNN:
    # Google TensorFlow Deep Neural Network
    model_type : classification
    params     : {"feature_columns" : [tf.contrib.layers.real_valued_column("", dimension=4)],
                  "n_classes" : 2,
                  "hidden_units" : [20, 40, 20]}
    grid       : {}
    scoring    : False

XGB:
    # XGBoost Binary
    model_type : classification
    params     : {"objective" : 'binary:logistic',
                  "n_estimators" : n_estimators,
                  "seed" : seed,
                  "max_depth" : 6,
                  "learning_rate" : 0.1,
                  "min_child_weight" : 1.1,
                  "subsample" : 0.9,
                  "colsample_bytree" : 0.9,
                  "nthread" : n_jobs,
                  "silent" : True}
    grid       : {"n_estimators" : [21, 51, 101, 201, 501],
                  "max_depth" : [5, 6, 7, 8, 9, 10, 12, 15, 20],
                  "learning_rate" : [0.01, 0.02, 0.05, 0.1, 0.2],
                  "min_child_weight" : [1.0, 1.1],
                  "subsample" : [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
                  "colsample_bytree" : [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}
    scoring    : False

XGBM:
    # XGBoost Multiclass
    model_type : multiclass
    params     : {"objective" : 'multi:softmax',
                  "n_estimators" : n_estimators,
                  "seed" : seed,
                  "max_depth" : 10,
                  "learning_rate" : 0.1,
                  "min_child_weight" : 1.1,
                  "subsample" : 0.9,
                  "colsample_bytree" : 0.9,
                  "nthread" : n_jobs,
                  "silent" : True}
    grid       : {}
    scoring    : False

XGBR:
    # XGBoost Regression
    model_type : regression
    params     : {"objective" : 'reg:linear',
                  "n_estimators" : n_estimators,
                  "seed" : seed,
                  "max_depth" : 10,
                  "learning_rate" : 0.1,
                  "min_child_weight" : 1.1,
                  "subsample" : 0.9,
                  "colsample_bytree" : 0.9,
                  "seed" : seed,
                  "nthread" : n_jobs,
                  "silent" : True}
    grid       : {}
    scoring    : False

XT:
    # Extra Trees
    model_type : classification
    params     : {"n_estimators" : n_estimators,
                  "random_state" : seed,
                  "n_jobs" : n_jobs,
                  "verbose" : verbosity}
    grid       : {"n_estimators" : [21, 51, 101, 201, 501, 1001, 2001],
                  "max_features" : ['auto', 'sqrt', 'log2'],
                  "max_depth" : [3, 5, 7, 10, 20, 30],
                  "min_samples_split" : [2, 3],
                  "min_samples_leaf" : [1, 2],
                  "bootstrap" : [True, False],
                  "warm_start" : [True, False]}
    scoring    : True

XTR:
    # Extra Trees Regression
    model_type : regression
    params     : {"n_estimators" : n_estimators,
                  "random_state" : seed,
                  "n_jobs" : n_jobs,
                  "verbose" : verbosity}
    grid       : {}
    scoring    : False

Final Output

This is an example of your file structure after running the pipeline:

project
├── alphapy.log
├── config
    ├── algos.yml
    ├── model.yml
└── data
└── input
    ├── test.csv
    ├── train.csv
└── model
    ├── feature_map_20170325.pkl
    ├── model_20170325.pkl
└── output
    ├── predictions_20170325.csv
    ├── probabilities_20170325.csv
    ├── rankings_20170325.csv
    ├── submission_20170325.csv
└── plots
    ├── calibration_train.png
    ├── confusion_train_RF.png
    ├── confusion_train_XGB.png
    ├── feature_importance_train_RF.png
    ├── feature_importance_train_XGB.png
    ├── learning_curve_train_RF.png
    ├── learning_curve_train_XGB.png
    ├── roc_curve_train.png