CMSC320 Final Project: Steven Struglia, Michael Strobel, HtetMyat Aung¶

Introduction:¶

The stock market is historically known as a near impossible system to solve, and continues to prove difficult as COVID-19 impacts the financial stability of the world, and more specifically the United States. A good measure of how the stock market is performing on a broad scale is the S&P500, a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. Being able to predict fluctuations and movement of the S&P500 on both a micro and macro level can generate massive profits for those that are investing their money correctly.

In this tutorial we will:

Leverage a REST API to gather timeseries price, volume, and indicator data for each of our tickers.
Perform EDA on historical data to show that there is a correlation between the 10 largest S&P 500 components and spy.
Create a machine learning pipeline that will do the following:
- Clean the provided data
- Classify the data into stressed, rally, and neutral market environments
- Plot correlation heatmaps to validate our EDA
- Train and test K-Nearest Neighbors, Random Forest, and Multilayer Perceptron classifiers.
- Complete hyperparameter tuning on each of the classifiers.
Perform analysis on our best classifier, the Multilayer Perceptron classifier, including confusion matrix, and learning curve visualizations.

The data that we're going to need for this project is the S&P 500 data (in this case we will be using the spy etf) as well as data from the top 10 companies in the S&P. The top 10 companies as of 12.9.2020 are: AAPL, MSFT, AMZN, FB, GOOGL, GOOG, BRK.B, JNJ, JPM, V. The first thing that we need to do is import some data. For this, we will be using the AlphaVantage API. Documentation can be seen here.

!pip install alpaca_trade_api

import alpaca_trade_api as api
import pandas as pd
import time

# This is a personal API key that is used exclusively through this project. Hidden for privacy reasons.
Key = "XXXXXXXXXXXXXX"

# This function returns the short and long SMA, RSI, and ADX techincal indicators wrapped in a dataframe along with the daily open, high, low, close, and volume.
def get_data (symbol = "spy", SMA_short_pd = 12, SMA_long_pd = 20, RSI_pd = 10, ADX_pd = 10):
    output = pd.DataFrame()
    # Getting the short timeframe simple moving average
    output = pd.read_csv("https://www.alphavantage.co/query?function=SMA&symbol="+ symbol +"&interval=daily&time_period=" + str(SMA_short_pd) + "&series_type=open&datatype=csv&apikey="+ Key)
    output = output.rename(columns = {'SMA':'short_SMA'})
    # Getting the long timeframe simple moving average
    output['long_SMA'] = pd.read_csv("https://www.alphavantage.co/query?function=SMA&symbol="+ symbol +"&interval=daily&time_period=" + str(SMA_long_pd) + "&series_type=open&datatype=csv&apikey="+ Key)['SMA']
    # Getting the Relative Strength Index
    output['RSI'] = pd.read_csv("https://www.alphavantage.co/query?function=RSI&symbol="+ symbol +"&interval=daily&time_period=" + str(RSI_pd) + "&series_type=open&datatype=csv&apikey=" + Key)['RSI']
    # Getting the Average Directional Index
    output['ADX'] = pd.read_csv("https://www.alphavantage.co/query?function=ADX&symbol="+ symbol +"&interval=daily&time_period=" + str(ADX_pd) + "&datatype=csv&apikey=" + Key)['ADX']
    # Getting the daliy price data and volumne
    output[['open','high','low','close','volume']] = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=' + symbol + '&outputsize=full&apikey=' + Key + '&datatype=csv')[['open','high','low','close','volume']]

    return output.dropna()

# This function returns all of the data returned by get_data, for the top 10 S&P companies as well as SPY.
def get_top_data (SMA_short_pd = 12, SMA_long_pd = 20, RSI_pd = 10, ADX_pd = 10):
      symbol_list = ['SPY', 'AAPL', 'MSFT', 'AMZN', 'FB', 'GOOGL', 'GOOG', 'BRK.B', 'JNJ', 'JPM', 'V']
 
      output = {}  
      for symbol in symbol_list:
            print(symbol)
            output[symbol] = get_data (symbol = symbol, SMA_short_pd = SMA_short_pd, SMA_long_pd = SMA_long_pd, RSI_pd = RSI_pd, ADX_pd = ADX_pd)
             # Sadly we have to add these in otherwise the API starts rejecting our calls 
            if symbol != 'v': time.sleep(60)   
  
      return output

import os.path
from os import path, mkdir

# One of the problems with the API that we're using is that it is pretty slow. Because of this, we're going to store our data in a CSV to prevent further calls to it.
if not (path.exists("stock_data") and path.isdir('stock_data')):
    mkdir('stock_data')
    data_dict = get_top_data()
    for symbol, data in data_dict.items():
        name = "stock_data/" + str(symbol) + "_data.csv"
        data.to_csv(path_or_buf=name)

Retrieve SPY Data from CSV's generated by functions above to return dataframe of indicators and stock ticker price/volume data¶

First, let us perform some EDA on just the daily returns using closing price data¶

import numpy as np
import glob
import re
import pandas as pd
import matplotlib.pyplot as plt

# This table stores the closing prices for each of our tickers
close_price_table = pd.DataFrame()
for filepath in glob.iglob('stock_data/*.csv'):
    # Ignore the fundamental data
    if 'funda' in str(filepath):
        continue
    else:
        # Read and store the data for each ticker
        temp = pd.read_csv(filepath)
        start = 'stock_data/;'
        end = '_data.csv'
        s = filepath
        company = s[s.find(start)+len(start):s.rfind(end)]
        close_price_table[company] = temp['close']

# Adding a data column to our table. 
# Since all the API pulls happen at the same time, the data for each ticker has the same date.
close_price_table['date'] = pd.read_csv('stock_data/AAPL_data.csv')['time']
close_price_table = close_price_table.reindex(index=close_price_table.index[::-1]).reset_index()
close_price_table.drop('index', inplace=True, axis=1)
close_price_table.head()

Exploratory Data Analysis¶

Calculate percent change (daily return) and plot to confirm clean percent change distributions¶

$Percent \; Change_{t} = \frac{R_{t-1} - R_{t}}{R_{t}}$ on day $t$ where $R_{i}$ is the i'th day's rate of return

pct_change = close_price_table.drop('date',axis=1).pct_change()
pct_change.plot(figsize=(12,8))

<AxesSubplot:>

Plot of cumulative returns for each ticker and SPY calculated using pandas.cumsum()¶

The cumulative sum at row $t$ is given by: $C_{t} = \sum_{i=1}^t PC_{t}$ where $PC_{i}$ is the the i'th day's percent change in daily returns

cumulative = pct_change.cumsum().plot(figsize=(12,8))

Distribution plot of daily returns grouped by year¶

We would like to predict how future price fluctuations are determined through an analysis of previous historical data. To understand how stock daily returns are distributed and behave, we would like to see a distribution plot of price change. This is done using a facet grid from seaborn, and gives us a great intuition of the univariate distributions as they are plotted year by year.

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Convert the year column to a datetime object for the plot
pct_change['year'] = pd.to_datetime(close_price_table['date']).dt.year
dfm = pct_change.melt('year')
g = sns.FacetGrid(dfm, col='year', col_wrap = 6)
g = (g.map(sns.distplot, 'value'))
plt.xlim(-0.5, 0.5)
plt.show()

We see from the above facet wrap of distributions that the daily return for each year centers around zero or just below it and that the daily returns are normally distributed. We can also see that the daily return standard deviation increases during financially turbulent years, such as 2008 during the Housing Crisis, and 2020 during the COVID-19 epidemic.

Machine Learning Pipeline¶

Since we would like to see how fluctuations in price data of these top 10 stocks impacts the movement of the overall S&P500 market, we should categorize movement in S&P500 as rally, stressed, or neutral. If we observe that the fluctuations of the top 10 stocks' price data is statistically significantly impacting the S&P500 movement, it could help us determine which ones account for most of the movement. After ascertaining that there is a statistically significant correlation between the top 10 stocks and the S&P, we will build a feature set of technical indicators often used in the investment world: Short Simple Moving Average, Long Simple Moving Average, RSI (Relative Strength Index), and ADX (Average Directional Index) for each of our top 10 stocks and use these to predict movement in the S&P500. For the feature set of technical indicators, we will include the trend indicator, momentum indicator and volume indicator for multiple indicator strategy. After this we will compare the performance of multiple machine learning models with this prediction, along with cross-validation and hyper-parameter tuning of the models.

You can read more about multi-indicator strategy here.

# This dataframe will store the timeseries of the indicators for each symbol.
feature_table = pd.DataFrame()

for filepath in glob.iglob('stock_data/*.csv'):
    # Ignore the fundamental data
    if 'funda' in str(filepath):
        continue
    else:
        # Read the features from each file into our feature_table dataframe
        temp = pd.read_csv(filepath)
        start = 'stock_data/;'
        end = '_data.csv'
        s = filepath
        # Getting the company name
        company = s[s.find(start)+len(start):s.rfind(end)]
        feature_table["{}_short_SMA".format(company)] = temp['short_SMA']
        feature_table["{}_long_SMA".format(company)] = temp['long_SMA']
        feature_table["{}_RSI".format(company)] = temp['RSI']
        feature_table["{}_ADX".format(company)] = temp['ADX']
        feature_table["{}_volume".format(company)] = temp['volume']
        
# Adding a data column to our table. 
# Since all the API pulls happen at the same time, the data for each ticker has the same date.
feature_table['date'] = pd.read_csv('stock_data/AAPL_data.csv')['time']
feature_table = feature_table.reindex(index=feature_table.index[::-1]).reset_index()
feature_table.drop('index', inplace=True, axis=1)
feature_table['SPY'] = pd.read_csv('stock_data/SPY_data.csv')['close'].pct_change()

print("Shape of feature set: {}".format(feature_table.shape))
feature_table.head()

Shape of feature set: (3188, 57)

Labeler function for determining if the current S&P500 market is in a neutral market, stressed market, or rally market.¶

def labeler(x):
    # Here we use a daily positive change of .4% to specify rally.
    if x > 0.004:
        return "rally"
    # A change of -.4% specifies stressed,
    elif x < -0.004:
        return "stressed"
    # Otherwise, we will classify the market as neutral
    else:
        return "neutral"
        
feature_table['label'] = feature_table['SPY'].apply(labeler)
# Drop SPY daily returns from feature set
feature_table.drop('SPY', axis=1, inplace=True)
feature_table['label'].value_counts()

neutral     1364
stressed    1031
rally        793
Name: label, dtype: int64

We see that the label distributions for neutral, stressed, and rally are fairly well-balanced and simulate a relatively realistic market condition distribution.

Separate label and feature set and impute missing values¶

# Shift label by 1 day back so prediction will be predicting on the day after rather than predicting same-day classificaiton
feature_table['label'] = feature_table['label'].shift(-1)
# Because some indicators and tickers have a different shape, we will drop the NAs
feature_table = feature_table.dropna()

# Drop the label and date as they are not being used as features for our models.
X = feature_table.drop(['label','date'], axis=1)
y = feature_table['label']
print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))

Shape of X: (1671, 55)
Shape of y: (1671,)

We have 55 indicator features to input into our Classification models.

Category distribution of label column (S&P500 movement on any given day)¶

sns.countplot(feature_table['label'])

<AxesSubplot:xlabel='label', ylabel='count'>

Correlation Heatmaps of features for each separate indicator¶

feature_list = ['short_SMA', 'long_SMA', 'RSI', 'ADX', 'volume']

for feat in feature_list:
    temp = X.filter(like=feat, axis=1)
    sns.heatmap(temp.corr(), annot=True, fmt='.1g')
    plt.title("Correlation Heatmap of {} indicator for each stock in top 10 and SPY".format(feat))
    plt.show()

From the correlation heatmaps, we see that the technical indicators of the top 10 stocks, in general, have a strong positive correlation with the associated indicator of the SPY index. Most notably, the two moving average and RSI features have a very strong positive correlation in comparison to ADX and volume. However, since they all demonstrate a power of explaining variance, we will keep all 5 indicator features for each stock to input into our machine learning classifiers.

Comparison of Classification Model Performance for predicting 1-day future movement in S&P500 Index¶

We will be comparing the cross_val_score (mean of accuracies reported by 10-fold cross-validation) of 3 different classifiers: Random Forest, K-Nearest Neighbors, and Multilayer Perceptron in predicting 1-day future movement in the S&P500.

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np

cross_val_score(estimator, feature_set, target_variable, cross_validation_value) returns a list of 10 accuracy values (from a 10-fold cross-validation), which are then averaged to compute a single cross-validation score on each of the models.

classifier_names = ["Nearest Neighbors", "Random Forest", "Multilayer Perceptron"]
classifiers = [KNeighborsClassifier(), RandomForestClassifier(), MLPClassifier()]
print("--- Base Model Classifiers ---")
for name, clf in zip(classifier_names, classifiers):
    print("CV Score for {} Classifier: {}".format(name, np.mean(cross_val_score(clf, X, y, cv=10))))

--- Base Model Classifiers ---
CV Score for Nearest Neighbors Classifier: 0.30940618762475053
CV Score for Random Forest Classifier: 0.08659823210721414
CV Score for Multilayer Perceptron Classifier: 0.3554818933561449

We see that without any hyperparameter tuning, the Multilayer Perceptron classifier performs the best out of the three, with K-Nearest Neighbors falling closely behind and the Random Forest performing quite poorly. The Multilayer Perceptron and K-Nearest Neighbors classifiers were able to get just below a random chance guess (1/3 or 0.33333) accuracy for the prediction of the S&P500 movement.

Hyperparameter Tuning¶

Hyperparameter tuning (or hyperparameter optimization) is the action of selecting optimal parameters for a machine learning estimator/model in order to gain additional predictive performance. A hyperparameter is a controlling factor in how a machine learning object is constructed, and needs to be optimally tuned in order to have the model optimally solve the problem. The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. We implement a GridSearchCV (cross-validated grid search) object to optimially tune the hyperparameters of the K-Nearest Neighbors, Random Forest, and Multilayer Perceptron Classifiers. For more information regarding hyperparameter tuning specifically in the realm of scikit-learn, read more here

Hyperparameter Grid for K-Nearest Neighbors Classifier¶

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. For more information regarding the parameters and intuition behind the K-Nearest Neighbors Algorithm: K-Nearest Neighbors

leaf_size = list(range(1,20))
n_neighbors = list(range(1,20))
p=[1,2]
#Convert to dictionary
parameter_space = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
knn = KNeighborsClassifier()
# Perform 10-fold cross-validated grid search on KNN hyperparameters
knn_clf = GridSearchCV(knn, parameter_space, n_jobs=-1, cv=10, verbose=1)
# Fit the grid search model
best_knn = knn_clf.fit(X,y).best_estimator_
print(best_knn)

Fitting 10 folds for each of 722 candidates, totalling 7220 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 312 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done 812 tasks      | elapsed:   11.9s
[Parallel(n_jobs=-1)]: Done 1512 tasks      | elapsed:   18.0s
[Parallel(n_jobs=-1)]: Done 2832 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done 5032 tasks      | elapsed:   45.1s
[Parallel(n_jobs=-1)]: Done 7205 out of 7220 | elapsed:  1.0min remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 7220 out of 7220 | elapsed:  1.0min finished

KNeighborsClassifier(leaf_size=1, n_neighbors=18)

Hyperparameter Grid for Random Forest Classifier¶

Random Forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. This makes it an "ensemble" of trees, that is a finite number of trees working to obtain better predictive performance than if we had only had one tree. A large number of relatively uncorrelated trees operating as a committee will outperform any of the individual constituent trees, as the trees protect each other from their individual errors. For more information about the parameters and intuition behind the Random Forest Algorithm: Random Forest

rfc = RandomForestClassifier()
parameter_space = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
# Perform 10-fold cross-validated grid search on random forest hyperparameters
rfc_clf = GridSearchCV(rfc,parameter_space, n_jobs=-1, cv=10, verbose=1)
# Fit the random search model
best_rfc = rfc_clf.fit(X, y).best_estimator_
print(best_rfc)

Fitting 10 folds for each of 288 candidates, totalling 2880 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 14.8min
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed: 17.6min finished

RandomForestClassifier(max_depth=100, max_features=2, min_samples_leaf=5,
                       min_samples_split=10)

Hyperparameter Grid for Multilayer Perceptron Classifier¶

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function $f(\cdot): R^{m} \rightarrow R^{o}$ by training on a dataset, where is the number of dimensions for input and is the number of dimensions for output. Given a set of features $X = x_{1}, x_{2}, x_{3},....$ and a target $y$, it can learn a non-linear function approximator for either classification or regression. MLP trains using Backpropagation. More precisely, it trains using some form of gradient descent and the gradients are calculated using Backpropagation. For classification, it minimizes the Cross-Entropy loss function, giving a vector of probability estimates $P(y|x)$ per sample x. For more information about the MLP Classifier and Neural Networks in general, see: Neural Networks and MLP Classifier

mlpc = MLPClassifier()
parameter_space = {
    'max_iter': [1500,2000,2500],
    'hidden_layer_sizes': list(zip(np.arange(10, 60, 10),np.arange(10, 60, 10))),
    'activation': ['tanh', 'relu', 'logistic'],
    'solver': ['sgd', 'adam'],
    'alpha': 10.0 ** -np.arange(1, 10, 2),
    'learning_rate': ['constant','adaptive']
}
# Perform 10-fold CV Grid Search over MLP parameter space to tune parameters
mlp_clf = GridSearchCV(mlpc, parameter_space, n_jobs=-1, cv=10, verbose=1)
# Fit the grid search model
best_mlp = mlp_clf.fit(X, y).best_estimator_
print(best_mlp)

Fitting 10 folds for each of 900 candidates, totalling 9000 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   28.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 11.8min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 16.5min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed: 17.7min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed: 18.9min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed: 20.4min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed: 23.2min
[Parallel(n_jobs=-1)]: Done 9000 out of 9000 | elapsed: 25.1min finished

MLPClassifier(activation='logistic', alpha=1e-09, hidden_layer_sizes=(30, 30),
              learning_rate='adaptive', max_iter=1500)

Comparison of Hyperparameter Tuned Classifiers:¶

hyperparameter_tuned_names = ['Best K-Nearest Neighbors', 'Best Random Forest', 'Best Multilayer Perceptron']
hyperparameter_tuned_classifiers = [best_knn, best_rfc, best_mlp]
print('--- Hyperparameter Tuned Classifier Scores ---')
for name, clf in zip(hyperparameter_tuned_names, hyperparameter_tuned_classifiers):
    print("CV Score for {} Classifier: {}".format(name, np.mean(cross_val_score(clf, X, y, cv=10))))

--- Hyperparameter Tuned Classifier Scores ---
CV Score for Best K-Nearest Neighbors Classifier: 0.32673224978614207
CV Score for Best Random Forest Classifier: 0.08898987738808098
CV Score for Best Multilayer Perceptron Classifier: 0.38597804391217566

We can see from the hyperparameter tuning scoring results that the Neural Network Classifier is the best classifier out of the three, with the score being a 0.39, slightly better than random guessing on the movement of the S&P500.

Analysis of Multilayer Perceptron Classifier Performance¶

Plot of Confusion Matrix Heatmap to examine classifier performance based on label value¶

A confusion matrix represents the combinations of predictions and actual values for a machine learning classification model. Graphically, it shows how often the model corrently and incorrectly predicts a label/category based off of the true value.

from sklearn.metrics import confusion_matrix
def plot_confusion_heatmap(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    df_cm = pd.DataFrame(cm, index= [i for i in ['Neutral', 'Rally', 'Stressed']], columns= [i for i in ['Neutral', 'Rally', 'Stressed']])
    ax = sns.heatmap(df_cm, cmap="Blues", annot=True, fmt = ".3g")
    ax.set(xlabel="Predicted Values", ylabel = "Actual Values")
    return ax

y_true = y
y_pred = best_mlp.predict(X)
fig, axes = plt.subplots(1, 1, figsize=(8, 8))
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
fig = plot_confusion_heatmap(y_true, y_pred)
plt.show()

From the confusion matrix, we see that the multilayer perceptron classifier tends to weight heavy towards a prediction of "neutral" for the S&P500 1-day future movement. This makes sense as the "neutral" label falls in between "rally" and "stressed" numerically, and wrong classifications of "rally" and "stressed" will more than likely fall into a classification of "neutral" due to the fact that the classifier would have to skip the range of the "neutral" label and jump into the opposite trend if it were to wrongly predict "rally" for an actual "stressed" and vice versa. The classifier has identified this and when in doubt, will make the safe decision of "neutral" if it is going to predict a classification with low probability. Even though the classes are balanced in our original data, we do end up seeing an imbalanced set of predictions for the movement.

Plot of Learning and Scalability Curves for Neural Network Classifier¶

Learning curves demonstrate how a model's accuracy is impacted and changed by the number of training samples that are taken in for the input, and also display the scalability of a model by plotting its time to compute a fit on the input space as the input space gets larger.

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
   
    if axes is None:
        _, axes = plt.subplots(1, 2, figsize=(20, 5))

    axes[0].set_title("Learning curve for {} Classifier".format(title))
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")
    
    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, return_times=True)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    return plt

fig, axes = plt.subplots(2, 2, figsize=(10, 10))
plot_learning_curve(MLPClassifier(), "Base MLP", X, y, axes=axes[:, 0], ylim=(0.05, 1.01),
                    cv=10, n_jobs=-1)
plot_learning_curve(best_mlp, "Best MLP", X, y, axes=axes[:, 1], ylim=(0.05, 1.01),
                    cv=10, n_jobs=-1)
plt.show()

We can see from the learning curve of the MLP classifier that as training examples increased, the base MLP classifier model accuracy remained relatively stagnant across the graph. On the other hand, the hyperparameter tuned MLP classifier model accuracy dipped once it hit around 500 training examples, and converged with the cross-validation score to a CV-score of 0.37. The scalability of the model didn't improve upon hyperparameter tuning, however the fit time is still within reasonable ranges.

Conclusion:¶

In this tutorial we have:

Leveraged a REST API to gather timeseries price, volume, and indicator data for each of our tickers.
Performed EDA on historical data to show that there is a correlation between the 10 largest S&P 500 components and spy.
Created a machine learning pipeline that did the following:
- Cleaned the provided data
- Classified the data into stressed, rally, and neutral market environments
- Plotted correlation heatmaps to validate our EDA
- Trained and tested K-Nearest Neighbors, Random Forest, and Multilayer Perceptron classifiers.
- Completed hyperparameter tuning on each of the classifiers.
Performed analysis on our best classifier, the Multilayer Perceptron classifier, including confusion matrix, and learning curve visualizations.

As you can see, we have accomplished quite a lot. Although our results show that we are correct over 33% of the time (precisely 39% of the time), which is only slightly better than guessing, any improvement is noteworthy. The stock market is a notoriously hard system to predict. With large firms building, training, and inventing new algorithms every day, it gets continually harder to compete. As a result, we’re classifying this as a win. We hope you learned something, feel free to reach out if you have any questions.

	V	JPM	GOOG	SPY	JNJ	MSFT	AAPL	BRK.B	FB	GOOGL	AMZN	date
0	66.14	44.96	NaN	136.85	65.70	28.95	153.70	4320.0	NaN	455.03	74.59	2008-04-16
1	67.04	45.12	NaN	137.05	65.76	29.22	154.49	4282.0	NaN	449.54	74.04	2008-04-17
2	69.00	45.76	NaN	138.48	66.51	30.00	161.04	4288.0	NaN	539.41	80.10	2008-04-18
3	70.97	45.23	NaN	138.55	66.32	30.42	168.16	4275.0	NaN	537.79	80.18	2008-04-21
4	70.28	45.65	NaN	137.94	66.99	30.25	160.20	4241.0	NaN	555.00	79.60	2008-04-22

	V_short_SMA	V_long_SMA	V_RSI	V_ADX	V_volume	JPM_short_SMA	JPM_long_SMA	JPM_RSI	JPM_ADX	JPM_volume	...	GOOGL_RSI	GOOGL_ADX	GOOGL_volume	AMZN_short_SMA	AMZN_long_SMA	AMZN_RSI	AMZN_ADX	AMZN_volume	date	SPY
0	14.8594	14.6057	54.4353	54.5660	6284200	32.7231	32.4441	50.9696	23.0618	52568400	...	41.0733	16.6240	15246100	74.9504	74.0043	49.3974	16.3420	6746800	2008-04-16	NaN
1	14.9254	14.6853	58.3022	53.3006	5614900	32.6973	32.5009	53.8022	22.3254	30302500	...	49.1686	16.5213	26679200	74.8892	74.1588	52.6769	14.7240	6220000	2008-04-17	0.000327
2	15.0455	14.7953	62.2494	53.7723	26024900	32.6552	32.6411	62.7214	23.0409	31398100	...	75.4229	20.0159	36434700	74.8342	74.4713	59.3394	17.2421	16791300	2008-04-18	0.009050
3	15.1935	14.8234	65.7822	54.9942	14712000	32.6253	32.6495	56.1943	22.5429	23513400	...	76.1138	23.1611	14864500	75.2592	74.7928	67.9502	19.5084	9335400	2008-04-21	-0.002918
4	15.3311	14.9544	70.3990	56.0940	11839400	32.5710	32.6339	54.3815	22.3965	23703800	...	75.4072	26.6170	15861100	75.6492	74.9983	66.6802	20.1956	7245000	2008-04-22	0.002059