AlphaPy

AlphaPy Model Pipeline

Model Object Creation

AlphaPy first reads the model.yml file and then displays the model parameters as confirmation that the file was read successfully. As shown in the example below, the Random Forest (RF) and XGBoost (XGB) algorithms are used to build the model. From the model specifications, a Model object will be created.

All of the model parameters are listed in alphabetical order. At a minimum, scan for algorithms, features, model_type, and target to verify their accuracy, i.e., that you are running the right model. The verbosity parameter will control the degree of output that you see when running the pipeline.

alphapy.log
[12/30/17 23:17:49] INFO	********************************************************************************
[12/30/17 23:17:49] INFO	AlphaPy Start
[12/30/17 23:17:49] INFO	********************************************************************************
[12/30/17 23:17:49] INFO	Model Configuration
[12/30/17 23:17:49] INFO	No Treatments Found
[12/30/17 23:17:49] INFO	MODEL PARAMETERS:
[12/30/17 23:17:49] INFO	algorithms        = ['RF', 'XGB']
[12/30/17 23:17:49] INFO	balance_classes   = True
[12/30/17 23:17:49] INFO	calibration       = False
[12/30/17 23:17:49] INFO	cal_type          = sigmoid
[12/30/17 23:17:49] INFO	calibration_plot  = False
[12/30/17 23:17:49] INFO	clustering        = True
[12/30/17 23:17:49] INFO	cluster_inc       = 3
[12/30/17 23:17:49] INFO	cluster_max       = 30
[12/30/17 23:17:49] INFO	cluster_min       = 3
[12/30/17 23:17:49] INFO	confusion_matrix  = True
[12/30/17 23:17:49] INFO	counts            = True
[12/30/17 23:17:49] INFO	cv_folds          = 3
[12/30/17 23:17:49] INFO	directory         = .
[12/30/17 23:17:49] INFO	extension         = csv
[12/30/17 23:17:49] INFO	drop              = ['PassengerId']
[12/30/17 23:17:49] INFO	encoder           = <Encoders.factorize: 3>
[12/30/17 23:17:50] INFO	esr               = 20
[12/30/17 23:17:50] INFO	factors           = []
[12/30/17 23:17:50] INFO	features [X]      = *
[12/30/17 23:17:50] INFO	feature_selection = False
[12/30/17 23:17:50] INFO	fs_percentage     = 50
[12/30/17 23:17:50] INFO	fs_score_func     = <function f_classif at 0x112f1dd90>
[12/30/17 23:17:50] INFO	fs_uni_grid       = [5, 10, 15, 20, 25]
[12/30/17 23:17:50] INFO	grid_search       = True
[12/30/17 23:17:50] INFO	gs_iters          = 50
[12/30/17 23:17:50] INFO	gs_random         = True
[12/30/17 23:17:50] INFO	gs_sample         = False
[12/30/17 23:17:50] INFO	gs_sample_pct     = 0.200000
[12/30/17 23:17:50] INFO	importances       = True
[12/30/17 23:17:50] INFO	interactions      = True
[12/30/17 23:17:50] INFO	isomap            = False
[12/30/17 23:17:50] INFO	iso_components    = 2
[12/30/17 23:17:50] INFO	iso_neighbors     = 5
[12/30/17 23:17:50] INFO	isample_pct       = 10
[12/30/17 23:17:50] INFO	learning_curve    = True
[12/30/17 23:17:50] INFO	logtransform      = False
[12/30/17 23:17:50] INFO	lv_remove         = True
[12/30/17 23:17:50] INFO	lv_threshold      = 0.100000
[12/30/17 23:17:50] INFO	model_type        = <ModelType.classification: 1>
[12/30/17 23:17:50] INFO	n_estimators      = 51
[12/30/17 23:17:50] INFO	n_jobs            = -1
[12/30/17 23:17:50] INFO	ngrams_max        = 3
[12/30/17 23:17:50] INFO	numpy             = True
[12/30/17 23:17:50] INFO	pca               = False
[12/30/17 23:17:50] INFO	pca_inc           = 1
[12/30/17 23:17:50] INFO	pca_max           = 10
[12/30/17 23:17:50] INFO	pca_min           = 2
[12/30/17 23:17:50] INFO	pca_whiten        = False
[12/30/17 23:17:50] INFO	poly_degree       = 5
[12/30/17 23:17:50] INFO	pvalue_level      = 0.010000
[12/30/17 23:17:50] INFO	rfe               = True
[12/30/17 23:17:50] INFO	rfe_step          = 3
[12/30/17 23:17:50] INFO	roc_curve         = True
[12/30/17 23:17:50] INFO	rounding          = 2
[12/30/17 23:17:50] INFO	sampling          = False
[12/30/17 23:17:50] INFO	sampling_method   = <SamplingMethod.under_random: 12>
[12/30/17 23:17:50] INFO	sampling_ratio    = 0.500000
[12/30/17 23:17:50] INFO	scaler_option     = True
[12/30/17 23:17:50] INFO	scaler_type       = <Scalers.standard: 2>
[12/30/17 23:17:50] INFO	scipy             = False
[12/30/17 23:17:50] INFO	scorer            = roc_auc
[12/30/17 23:17:50] INFO	seed              = 42
[12/30/17 23:17:50] INFO	sentinel          = -1
[12/30/17 23:17:50] INFO	separator         = ,
[12/30/17 23:17:50] INFO	shuffle           = False
[12/30/17 23:17:50] INFO	split             = 0.400000
[12/30/17 23:17:50] INFO	submission_file   = gender_submission
[12/30/17 23:17:50] INFO	submit_probas     = False
[12/30/17 23:17:50] INFO	target [y]        = Survived
[12/30/17 23:17:50] INFO	target_value      = 1
[12/30/17 23:17:50] INFO	treatments        = None
[12/30/17 23:17:50] INFO	tsne              = False
[12/30/17 23:17:50] INFO	tsne_components   = 2
[12/30/17 23:17:50] INFO	tsne_learn_rate   = 1000.000000
[12/30/17 23:17:50] INFO	tsne_perplexity   = 30.000000
[12/30/17 23:17:50] INFO	vectorize         = False
[12/30/17 23:17:50] INFO	verbosity         = 0
[12/30/17 23:17:50] INFO	Creating directory ./data

Data Ingestion

Data are loaded from both the training file and the test file. Any features that you wish to remove from the data are then dropped. Statistics about the shape of the data and the target variable proportions are logged.

alphapy.log
[12/30/17 23:17:50] INFO	Creating directory ./model
[12/30/17 23:17:50] INFO	Creating directory ./output
[12/30/17 23:17:50] INFO	Creating directory ./plots
[12/30/17 23:17:50] INFO	Creating Model
[12/30/17 23:17:50] INFO	Calling Pipeline
[12/30/17 23:17:50] INFO	Training Pipeline
[12/30/17 23:17:50] INFO	Loading Data
[12/30/17 23:17:50] INFO	Loading data from ./input/train.csv
[12/30/17 23:17:50] INFO	Found target Survived in data frame
[12/30/17 23:17:50] INFO	Labels (y) found for Partition.train
[12/30/17 23:17:50] INFO	Loading Data
[12/30/17 23:17:50] INFO	Loading data from ./input/test.csv
[12/30/17 23:17:50] INFO	Target Survived not found in Partition.test
[12/30/17 23:17:50] INFO	Saving New Features in Model
[12/30/17 23:17:50] INFO	Original Feature Statistics
[12/30/17 23:17:50] INFO	Number of Training Rows    : 891
[12/30/17 23:17:50] INFO	Number of Training Columns : 11
[12/30/17 23:17:50] INFO	Unique Training Values for Survived : [0 1]
[12/30/17 23:17:50] INFO	Unique Training Counts for Survived : [549 342]

Feature Processing

There are two stages to feature processing. First, you may want to transform a column of a dataframe into a different format or break up a feature into its respective components. This is known as a treatment, and it is a one-to-many transformation. For example, a date feature can be extracted into day, month, and year.

The next stage is feature type determination, which applies to all features, regardless of whether or not a treatment has been previously applied. The unique number of a feature’s values dictates whether or not that feature is a factor. If the given feature is a factor, then a specific type of encoding is applied. Otherwise, the feature is generally either text or a number.

Feature Flowchart

In the example below, each feature’s type is identified along with the unique number of values. For factors, a specific type of encoding is selected, as specified in the model.yml file. For text, you can choose either count vectorization and TF-IDF or just plain factorization. Numerical features have both imputation and log-transformation options.

alphapy.log
[12/30/17 23:17:50] INFO	Number of Testing Rows     : 418
[12/30/17 23:17:50] INFO	Number of Testing Columns  : 11
[12/30/17 23:17:50] INFO	Original Features : Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
[12/30/17 23:17:50] INFO	Feature Count     : 11
[12/30/17 23:17:50] INFO	Applying Treatments
[12/30/17 23:17:50] INFO	No Treatments Specified
[12/30/17 23:17:50] INFO	New Feature Count : 11
[12/30/17 23:17:50] INFO	Dropping Features: ['PassengerId']
[12/30/17 23:17:50] INFO	Original Feature Count : 11
[12/30/17 23:17:50] INFO	Reduced Feature Count  : 10
[12/30/17 23:17:50] INFO	Writing data frame to ./input/train_20171230.csv
[12/30/17 23:17:50] INFO	Writing data frame to ./input/test_20171230.csv
[12/30/17 23:17:50] INFO	Creating Cross-Tabulations
[12/30/17 23:17:50] INFO	Original Features : Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')
[12/30/17 23:17:50] INFO	Feature Count     : 10
[12/30/17 23:17:50] INFO	Creating Count Features
[12/30/17 23:17:50] INFO	NA Counts
[12/30/17 23:17:50] INFO	Number Counts
[12/30/17 23:17:50] INFO	New Feature Count : 21
[12/30/17 23:17:50] INFO	Creating Base Features
[12/30/17 23:17:50] INFO	Feature 1: Pclass is a numerical feature of type int64 with 3 unique values
[12/30/17 23:17:50] INFO	Feature 2: Name is a text feature [12:82] with 1307 unique values
[12/30/17 23:17:50] INFO	Feature 2: Name => Factorization
[12/30/17 23:17:50] INFO	Feature 3: Sex is a text feature [4:6] with 2 unique values
[12/30/17 23:17:50] INFO	Feature 3: Sex => Factorization
[12/30/17 23:17:50] INFO	Feature 4: Age is a numerical feature of type float64 with 99 unique values
[12/30/17 23:17:50] INFO	Feature 5: SibSp is a numerical feature of type int64 with 7 unique values
[12/30/17 23:17:50] INFO	Feature 6: Parch is a numerical feature of type int64 with 8 unique values
[12/30/17 23:17:50] INFO	Feature 7: Ticket is a text feature [3:18] with 929 unique values
[12/30/17 23:17:50] INFO	Feature 7: Ticket => Factorization
[12/30/17 23:17:50] INFO	Feature 8: Fare is a numerical feature of type float64 with 282 unique values
[12/30/17 23:17:50] INFO	Feature 9: Cabin is a text feature [1:15] with 187 unique values
[12/30/17 23:17:50] INFO	Feature 9: Cabin => Factorization
[12/30/17 23:17:50] INFO	Feature 10: Embarked is a text feature [1:1] with 4 unique values
[12/30/17 23:17:50] INFO	Feature 10: Embarked => Factorization
[12/30/17 23:17:50] INFO	Feature 11: nan_count is a numerical feature of type int64 with 3 unique values
[12/30/17 23:17:50] INFO	Feature 12: count_0 is a numerical feature of type int64 with 4 unique values
[12/30/17 23:17:50] INFO	Feature 13: count_1 is a numerical feature of type int64 with 4 unique values
[12/30/17 23:17:50] INFO	Feature 14: count_2 is a numerical feature of type int64 with 4 unique values
[12/30/17 23:17:50] INFO	Feature 15: count_3 is a numerical feature of type int64 with 4 unique values
[12/30/17 23:17:50] INFO	Feature 16: count_4 is a numerical feature of type int64 with 3 unique values
[12/30/17 23:17:50] INFO	Feature 17: count_5 is a numerical feature of type int64 with 2 unique values
[12/30/17 23:17:50] INFO	Feature 18: count_6 is a numerical feature of type int64 with 2 unique values
[12/30/17 23:17:50] INFO	Feature 19: count_7 is a numerical feature of type int64 with 2 unique values
[12/30/17 23:17:50] INFO	Feature 20: count_8 is a numerical feature of type int64 with 3 unique values
[12/30/17 23:17:50] INFO	Feature 21: count_9 is a numerical feature of type int64 with 3 unique values
[12/30/17 23:17:50] INFO	New Feature Count : 21
[12/30/17 23:17:50] INFO	Scaling Base Features
[12/30/17 23:17:50] INFO	Creating NumPy Features
[12/30/17 23:17:50] INFO	NumPy Feature: sum
[12/30/17 23:17:50] INFO	NumPy Feature: mean
[12/30/17 23:17:50] INFO	NumPy Feature: standard deviation
[12/30/17 23:17:50] INFO	NumPy Feature: variance
[12/30/17 23:17:50] INFO	NumPy Feature Count : 4
[12/30/17 23:17:50] INFO	New Feature Count : 25
[12/30/17 23:17:50] INFO	Creating Clustering Features
[12/30/17 23:17:50] INFO	Cluster Minimum   : 3
[12/30/17 23:17:50] INFO	Cluster Maximum   : 30
[12/30/17 23:17:50] INFO	Cluster Increment : 3
[12/30/17 23:17:50] INFO	k = 3
[12/30/17 23:17:50] INFO	k = 6
[12/30/17 23:17:50] INFO	k = 9
[12/30/17 23:17:50] INFO	k = 12
[12/30/17 23:17:50] INFO	k = 15
[12/30/17 23:17:50] INFO	k = 18

As AlphaPy runs, you can see the number of new features that are generated along the way, depending on which features you selected in the features section of the model.yml file. For interactions, you specify the polynomial degree and the percentage of the interactions that you would like to retain in the model. Be careful of the polynomial degree, as the number of interaction terms is exponential.

alphapy.log
[12/30/17 23:17:50] INFO	k = 21
[12/30/17 23:17:50] INFO	k = 24
[12/30/17 23:17:50] INFO	k = 27
[12/30/17 23:17:51] INFO	k = 30
[12/30/17 23:17:51] INFO	Clustering Feature Count : 10
[12/30/17 23:17:51] INFO	New Feature Count : 35
[12/30/17 23:17:51] INFO	Saving New Features in Model
[12/30/17 23:17:51] INFO	Creating Interactions
[12/30/17 23:17:51] INFO	Initial Feature Count  : 35
[12/30/17 23:17:51] INFO	Generating Polynomial Features
[12/30/17 23:17:51] INFO	Interaction Percentage : 10
[12/30/17 23:17:51] INFO	Polynomial Degree      : 5
[12/30/17 23:17:51] INFO	Polynomial Feature Count : 15

Feature Selection

There are two types of feature selection:

  • Univariate Selection
  • Recursive Feature Elimination (RFE)

Univariate selection finds the informative features based on a percentile of the highest scores, using a scoring function such as ANOVA F-Scores or Chi-squared statistics. There are scoring functions for both classification and regression.

RFE is more time-consuming, but has cross-validation with a configurable scoring function and step size. We also recommend using a seed for reproducible results, as the resulting support vector (a ranking of the features) can vary dramatically across runs.

alphapy.log
[12/30/17 23:17:51] INFO	Getting Class Weights
[12/30/17 23:17:51] INFO	Class Weight for target Survived [1]: 1.605263
[12/30/17 23:17:51] INFO	Getting All Estimators
[12/30/17 23:17:51] INFO	Algorithm Configuration

Model Estimation

A classification model is highly dependent on the class proportions. If you’re trying to predict a rare pattern with high accuracy, then training for accuracy will be useless because a dumb classifier could just predict the majority class and be right most of the time. As a result, AlphaPy gives data scientists the ability to undersample majority classes or oversample minority classes. There are even techniques that combine the two, e.g., SMOTE or ensemble sampling.

Before estimation, we need to apply sampling and possibly shuffling to improve cross-validation. For example, time series data is ordered, and you may want to eliminate that dependency.

At the beginning of the estimation phase, we read in all of the algorithms from the algos.yml file and then select those algorithms used in this particular model. The process is iterative for each algorithm: initial fit, feature selection, grid search, and final fit.

alphapy.log
[12/30/17 23:17:51] INFO	New Total Feature Count  : 50
[12/30/17 23:17:51] INFO	Saving New Features in Model
[12/30/17 23:17:51] INFO	Removing Low-Variance Features
[12/30/17 23:17:51] INFO	Low-Variance Threshold  : 0.10
[12/30/17 23:17:51] INFO	Original Feature Count  : 50
[12/30/17 23:17:51] INFO	Reduced Feature Count   : 50
[12/30/17 23:17:51] INFO	Saving New Features in Model
[12/30/17 23:17:51] INFO	Skipping Shuffling
[12/30/17 23:17:51] INFO	Skipping Sampling
[12/30/17 23:17:51] INFO	Getting Class Weights
[12/30/17 23:17:51] INFO	Class Weight for target Survived [1]: 1.605263
[12/30/17 23:17:51] INFO	Getting All Estimators
[12/30/17 23:17:51] INFO	Algorithm Configuration
[12/30/17 23:17:51] INFO	Selecting Models
[12/30/17 23:17:51] INFO	Algorithm: RF
[12/30/17 23:17:51] INFO	Fitting Initial Model
[12/30/17 23:17:51] INFO	Recursive Feature Elimination with CV
[12/30/17 23:18:14] INFO	RFECV took 22.72 seconds for step 3 and 3 folds
[12/30/17 23:18:14] INFO	Algorithm: RF, Selected Features: 20, Ranking: [ 2  1  1  1  5  9  1  1  2  6  8  7  8  6  7 10 11 11 11 10 10  1  1  1  1
  9  6  9  5  1  5  4  2  4  1  1  1  3  7  1  1  8  1  1  4  1  1  3  3  1]
[12/30/17 23:18:14] INFO	Randomized Grid Search
[12/30/17 23:19:08] INFO	Grid Search took 54.03 seconds for 50 candidate parameter settings.
[12/30/17 23:19:08] INFO	Model with rank: 1
[12/30/17 23:19:08] INFO	Mean validation score: 0.863 (std: 0.014)
[12/30/17 23:19:08] INFO	Parameters: {'est__n_estimators': 501, 'est__min_samples_split': 5, 'est__min_samples_leaf': 3, 'est__max_depth': 7, 'est__criterion': 'entropy', 'est__bootstrap': True}
[12/30/17 23:19:08] INFO	Model with rank: 2
[12/30/17 23:19:08] INFO	Mean validation score: 0.862 (std: 0.015)
[12/30/17 23:19:08] INFO	Parameters: {'est__n_estimators': 201, 'est__min_samples_split': 10, 'est__min_samples_leaf': 2, 'est__max_depth': 7, 'est__criterion': 'entropy', 'est__bootstrap': True}
[12/30/17 23:19:08] INFO	Model with rank: 3
[12/30/17 23:19:08] INFO	Mean validation score: 0.861 (std: 0.014)
[12/30/17 23:19:08] INFO	Parameters: {'est__n_estimators': 101, 'est__min_samples_split': 2, 'est__min_samples_leaf': 3, 'est__max_depth': 7, 'est__criterion': 'entropy', 'est__bootstrap': True}
[12/30/17 23:19:08] INFO	Algorithm: RF, Best Score: 0.8627, Best Parameters: {'est__n_estimators': 501, 'est__min_samples_split': 5, 'est__min_samples_leaf': 3, 'est__max_depth': 7, 'est__criterion': 'entropy', 'est__bootstrap': True}
[12/30/17 23:19:08] INFO	Final Model Predictions for RF
[12/30/17 23:19:08] INFO	Skipping Calibration
[12/30/17 23:19:08] INFO	Making Predictions
[12/30/17 23:19:09] INFO	Predictions Complete
[12/30/17 23:19:09] INFO	Algorithm: XGB
[12/30/17 23:19:09] INFO	Fitting Initial Model
[12/30/17 23:19:09] INFO	No RFE Available for XGB
[12/30/17 23:19:09] INFO	Randomized Grid Search
[12/30/17 23:19:32] INFO	Grid Search took 23.44 seconds for 50 candidate parameter settings.
[12/30/17 23:19:32] INFO	Model with rank: 1
[12/30/17 23:19:32] INFO	Mean validation score: 0.863 (std: 0.020)
[12/30/17 23:19:32] INFO	Parameters: {'est__subsample': 0.6, 'est__n_estimators': 21, 'est__min_child_weight': 1.1, 'est__max_depth': 12, 'est__learning_rate': 0.1, 'est__colsample_bytree': 0.7}
[12/30/17 23:19:32] INFO	Model with rank: 2
[12/30/17 23:19:32] INFO	Mean validation score: 0.856 (std: 0.014)
[12/30/17 23:19:32] INFO	Parameters: {'est__subsample': 0.5, 'est__n_estimators': 51, 'est__min_child_weight': 1.0, 'est__max_depth': 8, 'est__learning_rate': 0.01, 'est__colsample_bytree': 0.7}
[12/30/17 23:19:32] INFO	Model with rank: 3

Model Evaluation

Each model is evaluated using all of the metrics available in scikit-learn to give you a sense of how other scoring functions compare. Metrics are calculated on the training data for every algorithm. If test labels are present, then metrics are also calculated for the test data.

alphapy.log
[12/30/17 23:19:32] INFO	Final Model Predictions for XGB
[12/30/17 23:19:32] INFO	Skipping Calibration
[12/30/17 23:19:32] INFO	Making Predictions
[12/30/17 23:19:32] INFO	Predictions Complete
[12/30/17 23:19:32] INFO	Blending Models
[12/30/17 23:19:32] INFO	Blending Start: 2017-12-30 23:19:32.734086
[12/30/17 23:19:32] INFO	Blending Complete: 0:00:00.010781
[12/30/17 23:19:32] INFO	================================================================================
[12/30/17 23:19:32] INFO	Metrics for: Partition.train
[12/30/17 23:19:32] INFO	--------------------------------------------------------------------------------
[12/30/17 23:19:32] INFO	Algorithm: RF
[12/30/17 23:19:32] INFO	accuracy: 0.895622895623
[12/30/17 23:19:32] INFO	adjusted_rand_score: 0.623145355109
[12/30/17 23:19:32] INFO	average_precision: 0.939127507197
[12/30/17 23:19:32] INFO	confusion_matrix: [[530  19]
 [ 74 268]]
[12/30/17 23:19:32] INFO	explained_variance: 0.574782432706
[12/30/17 23:19:32] INFO	f1: 0.852146263911
[12/30/17 23:19:32] INFO	mean_absolute_error: 0.104377104377
[12/30/17 23:19:32] INFO	median_absolute_error: 0.0
[12/30/17 23:19:32] INFO	neg_log_loss: 0.299193724458
[12/30/17 23:19:32] INFO	neg_mean_squared_error: 0.104377104377
[12/30/17 23:19:32] INFO	precision: 0.933797909408
[12/30/17 23:19:32] INFO	r2: 0.558671268335
[12/30/17 23:19:32] INFO	recall: 0.783625730994
[12/30/17 23:19:32] INFO	roc_auc: 0.954665047561
[12/30/17 23:19:32] INFO	--------------------------------------------------------------------------------
[12/30/17 23:19:32] INFO	Algorithm: XGB
[12/30/17 23:19:32] INFO	accuracy: 0.915824915825
[12/30/17 23:19:32] INFO	adjusted_rand_score: 0.689583531462
[12/30/17 23:19:32] INFO	average_precision: 0.958117084885
[12/30/17 23:19:32] INFO	confusion_matrix: [[532  17]
 [ 58 284]]
[12/30/17 23:19:32] INFO	explained_variance: 0.653042746514
[12/30/17 23:19:32] INFO	f1: 0.883359253499
[12/30/17 23:19:32] INFO	mean_absolute_error: 0.0841750841751
[12/30/17 23:19:32] INFO	median_absolute_error: 0.0
[12/30/17 23:19:32] INFO	neg_log_loss: 0.295991903662
[12/30/17 23:19:32] INFO	neg_mean_squared_error: 0.0841750841751
[12/30/17 23:19:32] INFO	precision: 0.943521594684

Model Selection

Blended Model

Blended Model Creation
alphapy.log
[12/30/17 23:19:32] INFO	Mean validation score: 0.855 (std: 0.023)
[12/30/17 23:19:32] INFO	Parameters: {'est__subsample': 0.5, 'est__n_estimators': 21, 'est__min_child_weight': 1.0, 'est__max_depth': 7, 'est__learning_rate': 0.05, 'est__colsample_bytree': 0.6}
[12/30/17 23:19:32] INFO	Algorithm: XGB, Best Score: 0.8627, Best Parameters: {'est__subsample': 0.6, 'est__n_estimators': 21, 'est__min_child_weight': 1.1, 'est__max_depth': 12, 'est__learning_rate': 0.1, 'est__colsample_bytree': 0.7}

Best Model

The best model is selected from the score of:

  • a model for each algorithm, and
  • a blended model

Depending on the scoring function, best model selection is based on whether the score must be minimized or maximized. For example, the Area Under the Curve (AUC) must be maximized, and negative log loss must be minimized.

Best Model Selection

When more than one algorithm is scored in the estimation stage, the final step is to combine the predictions of each one and create the blended model, i.e., the predictions from the independent models are used as training features. For classification, AlphaPy uses logistic regression, and for regression, we use ridge regression.

alphapy.log
[12/30/17 23:19:32] INFO	precision: 0.943521594684
[12/30/17 23:19:32] INFO	r2: 0.644089732528
[12/30/17 23:19:32] INFO	recall: 0.830409356725
[12/30/17 23:19:32] INFO	roc_auc: 0.971127728246
[12/30/17 23:19:32] INFO	================================================================================
[12/30/17 23:19:32] INFO	Metrics for: Partition.test
[12/30/17 23:19:32] INFO	No labels for generating Partition.test metrics
[12/30/17 23:19:32] INFO	================================================================================
[12/30/17 23:19:32] INFO	Selecting Best Model
[12/30/17 23:19:32] INFO	Scoring for: Partition.train

Plot Generation

The user has the option of generating the following plots:

  • Calibration Plot
  • Confusion Matrix
  • Feature Importances
  • Learning Curve
  • ROC Curve

All plots are saved to the plots directory of your project.

alphapy.log
[12/30/17 23:19:32] INFO	Scoring for: Partition.train
[12/30/17 23:19:32] INFO	Best Model Selection Start: 2017-12-30 23:19:32.779281
[12/30/17 23:19:32] INFO	Scoring RF Model
[12/30/17 23:19:32] INFO	Scoring XGB Model
[12/30/17 23:19:32] INFO	Scoring BLEND Model
[12/30/17 23:19:32] INFO	Best Model is XGB with a roc_auc score of 0.9711
[12/30/17 23:19:32] INFO	Best Model Selection Complete: 0:00:00.000801
[12/30/17 23:19:32] INFO	================================================================================
[12/30/17 23:19:32] INFO	Generating Plots for partition: train
[12/30/17 23:19:32] INFO	Generating Calibration Plot
[12/30/17 23:19:32] INFO	Calibration for Algorithm: RF
[12/30/17 23:19:32] INFO	Calibration for Algorithm: XGB
[12/30/17 23:19:32] INFO	Writing plot to ./plots/calibration_train.png
[12/30/17 23:19:33] INFO	Generating Confusion Matrices
[12/30/17 23:19:33] INFO	Confusion Matrix for Algorithm: RF
[12/30/17 23:19:33] INFO	Confusion Matrix:
[12/30/17 23:19:33] INFO	[[530  19]
 [ 74 268]]
[12/30/17 23:19:33] INFO	Writing plot to ./plots/confusion_train_RF.png
[12/30/17 23:19:33] INFO	Confusion Matrix for Algorithm: XGB
[12/30/17 23:19:33] INFO	Confusion Matrix:
[12/30/17 23:19:33] INFO	[[532  17]
 [ 58 284]]
[12/30/17 23:19:33] INFO	Writing plot to ./plots/confusion_train_XGB.png
[12/30/17 23:19:33] INFO	Generating ROC Curves
[12/30/17 23:19:33] INFO	ROC Curve for Algorithm: RF
[12/30/17 23:19:33] INFO	ROC Curve for Algorithm: XGB
[12/30/17 23:19:33] INFO	Writing plot to ./plots/roc_curve_train.png
[12/30/17 23:19:33] INFO	Generating Learning Curves
[12/30/17 23:19:33] INFO	Algorithm Configuration
[12/30/17 23:19:33] INFO	Learning Curve for Algorithm: RF
[12/30/17 23:19:35] INFO	Writing plot to ./plots/learning_curve_train_RF.png
[12/30/17 23:19:35] INFO	Learning Curve for Algorithm: XGB
[12/30/17 23:19:36] INFO	Writing plot to ./plots/learning_curve_train_XGB.png
[12/30/17 23:19:36] INFO	Generating Feature Importance Plots
[12/30/17 23:19:36] INFO	Feature Importances for Algorithm: RF
[12/30/17 23:19:36] INFO	Feature Ranking:
[12/30/17 23:19:36] INFO	1. Feature 2 (0.106345)
[12/30/17 23:19:36] INFO	2. Feature 36 (0.074083)
[12/30/17 23:19:36] INFO	3. Feature 39 (0.055330)
[12/30/17 23:19:36] INFO	4. Feature 3 (0.050339)
[12/30/17 23:19:36] INFO	5. Feature 7 (0.049228)
[12/30/17 23:19:36] INFO	6. Feature 23 (0.044868)
[12/30/17 23:19:36] INFO	7. Feature 22 (0.042925)
[12/30/17 23:19:36] INFO	8. Feature 42 (0.042850)
[12/30/17 23:19:36] INFO	9. Feature 21 (0.039954)
[12/30/17 23:19:36] INFO	10. Feature 24 (0.038563)
[12/30/17 23:19:36] INFO	Writing plot to ./plots/feature_importance_train_RF.png
[12/30/17 23:19:36] INFO	Feature Importances for Algorithm: XGB
[12/30/17 23:19:36] INFO	Feature Ranking:
[12/30/17 23:19:36] INFO	1. Feature 2 (0.142857)
[12/30/17 23:19:36] INFO	2. Feature 3 (0.120301)
[12/30/17 23:19:36] INFO	3. Feature 0 (0.116541)
[12/30/17 23:19:36] INFO	4. Feature 7 (0.109023)
[12/30/17 23:19:36] INFO	5. Feature 31 (0.105263)

Calibration Plot

Calibration Plot

Confusion Matrix

Confusion Matrix

Feature Importances

Feature Importances

Learning Curve

Learning Curve

ROC Curve

ROC Curve

Final Results

  • The model object is stored in Pickle (.pkl) format in the models directory of the project. The model is loaded later in prediction mode.
  • The feature map is stored in Pickle (.pkl) format in the models directory. The feature map is restored for prediction mode.
  • Predictions are stored in the project’s output directory.
  • Sorted rankings of predictions are stored in output.
  • Any submission files are stored in output.
alphapy.log
[12/30/17 23:19:36] INFO	6. Feature 23 (0.075188)
[12/30/17 23:19:36] INFO	7. Feature 40 (0.071429)
[12/30/17 23:19:36] INFO	8. Feature 8 (0.033835)
[12/30/17 23:19:36] INFO	9. Feature 4 (0.030075)
[12/30/17 23:19:36] INFO	10. Feature 9 (0.030075)
[12/30/17 23:19:36] INFO	Writing plot to ./plots/feature_importance_train_XGB.png
[12/30/17 23:19:36] INFO	================================================================================
[12/30/17 23:19:36] INFO	Saving Model Predictor
[12/30/17 23:19:36] INFO	Writing model predictor to ./model/model_20171230.pkl
[12/30/17 23:19:36] INFO	Saving Feature Map
[12/30/17 23:19:36] INFO	Writing feature map to ./model/feature_map_20171230.pkl
[12/30/17 23:19:36] INFO	Loading data from ./input/test.csv