Kaggle Tutorial
AlphaPy Running Time: Approximately 2 minutes

The most popular introductory project on Kaggle is Titanic, in which you apply machine learning to predict which passengers were most likely to survive the sinking of the famous ship. In this tutorial, we will run AlphaPy to train a model, generate predictions, and create a submission file so you can see where you land on the Kaggle leaderboard.
Note
AlphaPy is a good starter for most Kaggle competitions. We also use it for other competitions such as the crowd-sourced hedge fund Numerai.
Step 1: From the examples
directory, change your directory:
cd Kaggle
Before running AlphaPy, let’s briefly review the model.yml
file in the config
directory. We will submit the actual
predictions (1 vs. 0) instead of the probabilities, so
submit_probas
is set to False
. All features will be
included except for the PassengerId
. The target variable
is Survived
, the label we are trying to accurately predict.
We’ll compare random forests and XGBoost, run recursive
feature elimination and a grid search, and select the best
model. Note that a blended model of all the algorithms is
a candidate for best model. The details of each algorithm
are located in the algos.yml
file.
project:
directory : .
file_extension : csv
submission_file : 'gender_submission'
submit_probas : False
data:
drop : ['PassengerId']
features : '*'
sampling :
option : False
method : under_random
ratio : 0.5
sentinel : -1
separator : ','
shuffle : False
split : 0.4
target : Survived
target_value : 1
model:
algorithms : ['RF', 'XGB']
balance_classes : True
calibration :
option : False
type : sigmoid
cv_folds : 3
estimators : 51
feature_selection :
option : False
percentage : 50
uni_grid : [5, 10, 15, 20, 25]
score_func : f_classif
grid_search :
option : True
iterations : 50
random : True
subsample : False
sampling_pct : 0.2
pvalue_level : 0.01
rfe :
option : True
step : 3
scoring_function : roc_auc
type : classification
features:
clustering :
option : True
increment : 3
maximum : 30
minimum : 3
counts :
option : True
encoding :
rounding : 2
type : factorize
factors : []
interactions :
option : True
poly_degree : 5
sampling_pct : 10
isomap :
option : False
components : 2
neighbors : 5
logtransform :
option : False
numpy :
option : True
pca :
option : False
increment : 1
maximum : 10
minimum : 2
whiten : False
scaling :
option : True
type : standard
scipy :
option : False
text :
ngrams : 3
vectorize : False
tsne :
option : False
components : 2
learning_rate : 1000.0
perplexity : 30.0
variance :
option : True
threshold : 0.1
pipeline:
number_jobs : -1
seed : 42
verbosity : 0
plots:
calibration : True
confusion_matrix : True
importances : True
learning_curve : True
roc_curve : True
xgboost:
stopping_rounds : 20
Step 2: Now, we are ready to run AlphaPy. Enter the following command:
alphapy
As alphapy
runs, you will see the progress of the workflow,
and the logging output is saved in alphapy.log
. When the
workflow completes, your project structure will look like this,
with a different datestamp:
Kaggle
├── alphapy.log
├── config
├── algos.yml
├── model.yml
└── data
└── input
├── test.csv
├── train.csv
└── model
├── feature_map_20170420.pkl
├── model_20170420.pkl
└── output
├── predictions_20170420.csv
├── probabilities_20170420.csv
├── rankings_20170420.csv
├── submission_20170420.csv
└── plots
├── calibration_train.png
├── confusion_train_RF.png
├── confusion_train_XGB.png
├── feature_importance_train_RF.png
├── feature_importance_train_XGB.png
├── learning_curve_train_RF.png
├── learning_curve_train_XGB.png
├── roc_curve_train.png
Step 3: To see how your model ranks on the Kaggle leaderboard,
upload the submission file from the output
directory to the
Web site https://www.kaggle.com/c/titanic/submit.
