Market Prediction Tutorial

MarketFlow Running Time: Approximately 6 minutes

Machine learning subsumes technical analysis because collectively, technical analysis is just a set of features for market prediction. We can use machine learning as a feature blender for moving averages, indicators such as RSI and ADX, and even representations of chart formations such as double tops and head-and-shoulder patterns.

We are not directly predicting net return in our models, although that is the ultimate goal. By characterizing the market with models, we can increase the Return On Investment (ROI). We have a wide range of dependent or target variables from which to choose, not just net return. There is more power in building a classifier rather than a more traditional regression model, so we want to define binary conditions such as whether or not today is going to be a trend day, rather than a numerical prediction of today’s return.

In this tutorial, we will train a model that predicts whether or not the next day will have a larger-than-average range. This is important for deciding which system to deploy on the prediction day. If our model gives us predictive power, then we can filter out those days where trading a given system is a losing strategy.

Step 1: From the examples directory, change your directory:

cd "Trading Model"

Before running MarketFlow, let’s briefly review the configuration files in the config directory:

market.yml:: The MarketFlow configuration file
model.yml:: The AlphaPy configuration file

In market.yml, we limit our model to six stocks in the target group test, going back 2000 trading days. You can define any group of stock symbols in the groups section, and then set the target_group attribute in the market section to the name of that group.

This is a 1-day forecast, but we also use those features that can be calculated at the market open, such as gap information in the leaders section. In the features section, we define many variables for moving averages, historical range, RSI, volatility, and volume.

market.yml

market:
    create_model    : True
    data_fractal    : 1d
    data_history    : 500
    forecast_period : 1
    fractal         : 1d
    lag_period      : 1
    leaders         : ['gap', 'gapbadown', 'gapbaup', 'gapdown', 'gapup']
    predict_history : 100
    schema          : yahoo
    subject         : stock
    target_group    : test

groups:
    all  : ['aaoi', 'aapl', 'acia', 'adbe', 'adi', 'adp', 'agn', 'aig', 'akam',
            'algn', 'alk', 'alxn', 'amat', 'amba', 'amd', 'amgn', 'amt', 'amzn',
            'antm', 'arch', 'asml', 'athn', 'atvi', 'auph', 'avgo', 'axp', 'ayx',
            'azo', 'ba', 'baba', 'bac', 'bby', 'bidu', 'biib', 'brcd', 'bvsn',
            'bwld', 'c', 'cacc', 'cara', 'casy', 'cat', 'cde', 'celg', 'cern',
            'chkp', 'chtr', 'clvs', 'cme', 'cmg', 'cof', 'cohr', 'comm', 'cost',
            'cpk', 'crm', 'crus', 'csco', 'ctsh', 'ctxs', 'csx', 'cvs', 'cybr',
            'data', 'ddd', 'deck', 'dgaz', 'dia', 'dis', 'dish', 'dnkn', 'dpz',
            'drys', 'dust', 'ea', 'ebay', 'edc', 'edz', 'eem', 'elli', 'eog',
            'esrx', 'etrm', 'ewh', 'ewt', 'expe', 'fang', 'fas', 'faz', 'fb',
            'fcx', 'fdx', 'ffiv', 'fit', 'five', 'fnsr', 'fslr', 'ftnt', 'gddy',
            'gdx', 'gdxj', 'ge', 'gild', 'gld', 'glw', 'gm', 'googl', 'gpro',
            'grub', 'gs', 'gwph', 'hal', 'has', 'hd', 'hdp', 'hlf', 'hog', 'hum',
            'ibb', 'ibm', 'ice', 'idxx', 'ilmn', 'ilmn', 'incy', 'intc', 'intu',
            'ip', 'isrg', 'iwm', 'ivv', 'iwf', 'iwm', 'jack', 'jcp', 'jdst', 'jnj',
            'jnpr', 'jnug', 'jpm', 'kite', 'klac', 'ko', 'kss', 'labd', 'labu',
            'len', 'lite', 'lmt', 'lnkd', 'lrcx', 'lulu', 'lvs', 'mbly', 'mcd',
            'mchp', 'mdy', 'meoh', 'mnst', 'mo', 'momo', 'mon', 'mrk', 'ms', 'msft',
            'mtb', 'mu', 'nflx', 'nfx', 'nke', 'ntap', 'ntes', 'ntnx', 'nugt',
            'nvda', 'nxpi', 'nxst', 'oii', 'oled', 'orcl', 'orly', 'p', 'panw',
            'pcln', 'pg', 'pm', 'pnra', 'prgo', 'pxd', 'pypl', 'qcom', 'qqq',
            'qrvo', 'rht', 'sam', 'sbux', 'sds', 'sgen', 'shld', 'shop', 'sig',
            'sina', 'siri', 'skx', 'slb', 'slv', 'smh', 'snap', 'sncr', 'soda',
            'splk', 'spy', 'stld', 'stmp', 'stx', 'svxy', 'swks', 'symc', 't',
            'tbt', 'teva', 'tgt', 'tho', 'tlt', 'tmo', 'tna', 'tqqq', 'trip',
            'tsla', 'ttwo', 'tvix', 'twlo', 'twtr', 'tza', 'uaa', 'ugaz', 'uhs',
            'ulta', 'ulti', 'unh', 'unp', 'upro', 'uri', 'ups', 'uri', 'uthr',
            'utx', 'uvxy', 'v', 'veev', 'viav', 'vlo', 'vmc', 'vrsn', 'vrtx', 'vrx',
            'vwo', 'vxx', 'vz', 'wday', 'wdc', 'wfc', 'wfm', 'wmt', 'wynn', 'x',
            'xbi', 'xhb', 'xiv', 'xle', 'xlf', 'xlk', 'xlnx', 'xom', 'xlp', 'xlu',
            'xlv', 'xme', 'xom', 'wix', 'yelp', 'z']
    etf  : ['dia', 'dust', 'edc', 'edz', 'eem', 'ewh', 'ewt', 'fas', 'faz',
            'gld', 'hyg', 'iwm', 'ivv', 'iwf', 'jnk', 'mdy', 'nugt', 'qqq',
            'sds', 'smh', 'spy', 'tbt', 'tlt', 'tna', 'tvix', 'tza', 'upro',
            'uvxy', 'vwo', 'vxx', 'xhb', 'xiv', 'xle', 'xlf', 'xlk', 'xlp',
            'xlu', 'xlv', 'xme']
    tech : ['aapl', 'adbe', 'amat', 'amgn', 'amzn', 'avgo', 'baba', 'bidu',
            'brcd', 'csco', 'ddd', 'emc', 'expe', 'fb', 'fit', 'fslr', 'goog',
            'intc', 'isrg', 'lnkd', 'msft', 'nflx', 'nvda', 'pcln', 'qcom',
            'qqq', 'tsla', 'twtr']
    test : ['aapl', 'amzn', 'goog', 'fb', 'nvda', 'tsla']

features: ['abovema_3', 'abovema_5', 'abovema_10', 'abovema_20', 'abovema_50',
           'adx', 'atr', 'bigdown', 'bigup', 'diminus', 'diplus', 'doji',
           'gap', 'gapbadown', 'gapbaup', 'gapdown', 'gapup',
           'hc', 'hh', 'ho', 'hl', 'lc', 'lh', 'll', 'lo', 'hookdown', 'hookup',
           'inside', 'outside', 'madelta_3', 'madelta_5', 'madelta_7', 'madelta_10',
           'madelta_12', 'madelta_15', 'madelta_18', 'madelta_20', 'madelta',
           'net', 'netdown', 'netup', 'nr_3', 'nr_4', 'nr_5', 'nr_7', 'nr_8',
           'nr_10', 'nr_18', 'roi', 'roi_2', 'roi_3', 'roi_4', 'roi_5', 'roi_10',
           'roi_20', 'rr_1_4', 'rr_1_7', 'rr_1_10', 'rr_2_5', 'rr_2_7', 'rr_2_10',
           'rr_3_8', 'rr_3_14', 'rr_4_10', 'rr_4_20', 'rr_5_10', 'rr_5_20',
           'rr_5_30', 'rr_6_14', 'rr_6_25', 'rr_7_14', 'rr_7_35', 'rr_8_22',
           'rrhigh', 'rrlow', 'rrover', 'rrunder', 'rsi_3', 'rsi_4', 'rsi_5',
           'rsi_6', 'rsi_8', 'rsi_10', 'rsi_14', 'sep_3_3', 'sep_5_5', 'sep_8_8',
           'sep_10_10', 'sep_14_14', 'sep_21_21', 'sep_30_30', 'sep_40_40',
           'sephigh', 'seplow', 'trend', 'vma', 'vmover', 'vmratio', 'vmunder',
           'volatility_3', 'volatility_5', 'volatility', 'volatility_20',
           'wr_2', 'wr_3', 'wr', 'wr_5', 'wr_6', 'wr_7', 'wr_10']

aliases:
    atr        : 'ma_truerange'
    aver       : 'ma_hlrange'
    cma        : 'ma_close'
    cmax       : 'highest_close'
    cmin       : 'lowest_close'
    hc         : 'higher_close'
    hh         : 'higher_high'
    hl         : 'higher_low'
    ho         : 'higher_open'
    hmax       : 'highest_high'
    hmin       : 'lowest_high'
    lc         : 'lower_close'
    lh         : 'lower_high'
    ll         : 'lower_low'
    lo         : 'lower_open'
    lmax       : 'highest_low'
    lmin       : 'lowest_low'
    net        : 'net_close'
    netdown    : 'down_net'
    netup      : 'up_net'
    omax       : 'highest_open'
    omin       : 'lowest_open'
    rmax       : 'highest_hlrange'
    rmin       : 'lowest_hlrange'
    rr         : 'maratio_hlrange'
    rixc       : 'rindex_close_high_low'
    rixo       : 'rindex_open_high_low'
    roi        : 'netreturn_close'
    rsi        : 'rsi_close'
    sepma      : 'ma_sep'
    vma        : 'ma_volume'
    vmratio    : 'maratio_volume'
    upmove     : 'net_high'

variables:
    abovema    : 'close > cma_50'
    belowma    : 'close < cma_50'
    bigup      : 'rrover & sephigh & netup'
    bigdown    : 'rrover & sephigh & netdown'
    doji       : 'sepdoji & rrunder'
    hookdown   : 'open > high[1] & close < close[1]'
    hookup     : 'open < low[1] & close > close[1]'
    inside     : 'low > low[1] & high < high[1]'
    madelta    : '(close - cma_50) / atr_10'
    nr         : 'hlrange == rmin_4'
    outside    : 'low < low[1] & high > high[1]'
    roihigh    : 'roi_5 >= 5'
    roilow     : 'roi_5 < -5'
    roiminus   : 'roi_5 < 0'
    roiplus    : 'roi_5 > 0'
    rrhigh     : 'rr_1_10 >= 1.2'
    rrlow      : 'rr_1_10 <= 0.8'
    rrover     : 'rr_1_10 >= 1.0'
    rrunder    : 'rr_1_10 < 1.0'
    sep        : 'rixc_1 - rixo_1'
    sepdoji    : 'abs(sep) <= 15'
    sephigh    : 'abs(sep_1_1) >= 70'
    seplow     : 'abs(sep_1_1) <= 30'
    trend      : 'rrover & sephigh'
    vmover     : 'vmratio >= 1'
    vmunder    : 'vmratio < 1'
    volatility : 'atr_10 / close'
    wr         : 'hlrange == rmax_4'

In each of the tutorials, we experiment with different options in model.yml to run AlphaPy. Here, we first apply univariate feature selection and then run a random forest classifier with Recursive Feature Elimination, including Cross-Validation (RFECV). When you choose RFECV, the process takes much longer, so if you want to see more logging, then increase the verbosity level in the pipeline section.

Since stock prices are time series data, we apply the runs_test function to twelve features in the treatments section. Treatments are powerful because you can write any function to extrapolate new features from existing ones. AlphaPy provides some of these functions in the alphapy.features module, but it can also import external functions as well.

Our target variable is rrover, the ratio of the 1-day range to the 10-day average high/low range. If that ratio is greater than or equal to 1.0, then the value of rrover is True. This is what we are trying to predict.

model.yml

project:
    directory         : .
    file_extension    : csv
    submission_file   :
    submit_probas     : False

data:
    drop              : ['date', 'tag', 'open', 'high', 'low', 'close', 'volume', 'adjclose',
                         'low[1]', 'high[1]', 'net', 'close[1]', 'rmin_3', 'rmin_4', 'rmin_5',
                         'rmin_7', 'rmin_8', 'rmin_10', 'rmin_18', 'pval', 'mval', 'vma',
                         'rmax_2', 'rmax_3', 'rmax_4', 'rmax_5', 'rmax_6', 'rmax_7', 'rmax_10']
    features          : '*'
    sampling          :
        option        : True
        method        : under_random
        ratio         : 0.5
    sentinel          : -1
    separator         : ','
    shuffle           : True
    split             : 0.4
    target            : rrover
    target_value      : True

model:
    algorithms        : ['RF']
    balance_classes   : True
    calibration       :
        option        : False
        type          : isotonic
    cv_folds          : 3
    estimators        : 501
    feature_selection :
        option        : True
        percentage    : 50
        uni_grid      : [5, 10, 15, 20, 25]
        score_func    : f_classif
    grid_search       :
        option        : False
        iterations    : 100
        random        : True
        subsample     : True
        sampling_pct  : 0.25
    pvalue_level      : 0.01
    rfe               :
        option        : True
        step          : 10
    scoring_function  : 'roc_auc'
    type              : classification

features:
    clustering        :
        option        : False
        increment     : 3
        maximum       : 30
        minimum       : 3
    counts            :
        option        : False
    encoding          :
        rounding      : 3
        type          : factorize
    factors           : []
    interactions      :
        option        : True
        poly_degree   : 2
        sampling_pct  : 5
    isomap            :
        option        : False
        components    : 2
        neighbors     : 5
    logtransform      :
        option        : False
    numpy             :
        option        : False
    pca               :
        option        : False
        increment     : 3
        maximum       : 15
        minimum       : 3
        whiten        : False
    scaling           :
        option        : True
        type          : standard
    scipy             :
        option        : False
    text              :
        ngrams        : 1
        vectorize     : False
    tsne              :
        option        : False
        components    : 2
        learning_rate : 1000.0
        perplexity    : 30.0
    variance          :
        option        : True
        threshold     : 0.1

treatments:
    doji              : ['alphapy.features', 'runs_test', ['all'], 18]
    hc                : ['alphapy.features', 'runs_test', ['all'], 18]
    hh                : ['alphapy.features', 'runs_test', ['all'], 18]
    hl                : ['alphapy.features', 'runs_test', ['all'], 18]
    ho                : ['alphapy.features', 'runs_test', ['all'], 18]
    rrhigh            : ['alphapy.features', 'runs_test', ['all'], 18]
    rrlow             : ['alphapy.features', 'runs_test', ['all'], 18]
    rrover            : ['alphapy.features', 'runs_test', ['all'], 18]
    rrunder           : ['alphapy.features', 'runs_test', ['all'], 18]
    sephigh           : ['alphapy.features', 'runs_test', ['all'], 18]
    seplow            : ['alphapy.features', 'runs_test', ['all'], 18]
    trend             : ['alphapy.features', 'runs_test', ['all'], 18]

pipeline:
    number_jobs       : -1
    seed              : 10231
    verbosity         : 0

plots:
    calibration       : True
    confusion_matrix  : True
    importances       : True
    learning_curve    : True
    roc_curve         : True

xgboost:
    stopping_rounds   : 20

Step 2: Now, let’s run MarketFlow:

mflow --pdate 2017-10-01

As mflow runs, you will see the progress of the workflow, and the logging output is saved in market_flow.log. When the workflow completes, your project structure will look like this, with a different datestamp:

Trading Model
├── market_flow.log
├── config
    ├── algos.yml
    ├── market.yml
    ├── model.yml
└── data
└── input
    ├── test_20170420.csv
    ├── test.csv
    ├── train_20170420.csv
    ├── train.csv
└── model
    ├── feature_map_20170420.pkl
    ├── model_20170420.pkl
└── output
    ├── predictions_20170420.csv
    ├── probabilities_20170420.csv
    ├── rankings_20170420.csv
└── plots
    ├── calibration_test.png
    ├── calibration_train.png
    ├── confusion_test_RF.png
    ├── confusion_train_RF.png
    ├── feature_importance_train_RF.png
    ├── learning_curve_train_RF.png
    ├── roc_curve_test.png
    ├── roc_curve_train.png

Let’s look at the results in the plots directory. Since our scoring function was roc_auc, we examine the ROC Curve first. The AUC is approximately 0.61, which is not very high but in the context of the stock market, we may still be able to derive some predictive power. Further, we are running the model on a relatively small sample of stocks, as denoted by the jittery line of the ROC Curve.

We can benefit from more samples, as the learning curve shows that the training and cross-validation lines have yet to converge.

The good news is that even with a relatively small number of testing points, the Reliability Curve slopes upward from left to right, with the dotted line denoting a perfect classifier.

To get better accuracy, we can raise our threshold to find the best candidates, since they are ranked by probability, but this also means limiting our pool of stocks. Let’s take a closer look at the rankings file.

Step 3: From the command line, enter:

jupyter notebook

Step 4: Click on the notebook named:

A Trading Model.ipynb

Step 5: Run the commands in the notebook, making sure that when you read in the rankings file, change the date to match the result from the ls command.

Conclusion We can predict large-range days with some confidence, but only at a higher probability threshold. This is important for choosing the correct system on any given day. We can achieve better results with more data, so we recommend expanding the stock universe, e.g., a group with at least 100 members going five years back.