The stock market is historically known as a near impossible system to solve, and continues to prove difficult as COVID-19 impacts the financial stability of the world, and more specifically the United States. A good measure of how the stock market is performing on a broad scale is the S&P500, a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. Being able to predict fluctuations and movement of the S&P500 on both a micro and macro level can generate massive profits for those that are investing their money correctly.
In this tutorial we will:
The data that we're going to need for this project is the S&P 500 data (in this case we will be using the spy etf) as well as data from the top 10 companies in the S&P. The top 10 companies as of 12.9.2020 are: AAPL, MSFT, AMZN, FB, GOOGL, GOOG, BRK.B, JNJ, JPM, V. The first thing that we need to do is import some data. For this, we will be using the AlphaVantage API. Documentation can be seen here.
!pip install alpaca_trade_api
import alpaca_trade_api as api
import pandas as pd
import time
# This is a personal API key that is used exclusively through this project. Hidden for privacy reasons.
Key = "XXXXXXXXXXXXXX"
# This function returns the short and long SMA, RSI, and ADX techincal indicators wrapped in a dataframe along with the daily open, high, low, close, and volume.
def get_data (symbol = "spy", SMA_short_pd = 12, SMA_long_pd = 20, RSI_pd = 10, ADX_pd = 10):
output = pd.DataFrame()
# Getting the short timeframe simple moving average
output = pd.read_csv("https://www.alphavantage.co/query?function=SMA&symbol="+ symbol +"&interval=daily&time_period=" + str(SMA_short_pd) + "&series_type=open&datatype=csv&apikey="+ Key)
output = output.rename(columns = {'SMA':'short_SMA'})
# Getting the long timeframe simple moving average
output['long_SMA'] = pd.read_csv("https://www.alphavantage.co/query?function=SMA&symbol="+ symbol +"&interval=daily&time_period=" + str(SMA_long_pd) + "&series_type=open&datatype=csv&apikey="+ Key)['SMA']
# Getting the Relative Strength Index
output['RSI'] = pd.read_csv("https://www.alphavantage.co/query?function=RSI&symbol="+ symbol +"&interval=daily&time_period=" + str(RSI_pd) + "&series_type=open&datatype=csv&apikey=" + Key)['RSI']
# Getting the Average Directional Index
output['ADX'] = pd.read_csv("https://www.alphavantage.co/query?function=ADX&symbol="+ symbol +"&interval=daily&time_period=" + str(ADX_pd) + "&datatype=csv&apikey=" + Key)['ADX']
# Getting the daliy price data and volumne
output[['open','high','low','close','volume']] = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=' + symbol + '&outputsize=full&apikey=' + Key + '&datatype=csv')[['open','high','low','close','volume']]
return output.dropna()
# This function returns all of the data returned by get_data, for the top 10 S&P companies as well as SPY.
def get_top_data (SMA_short_pd = 12, SMA_long_pd = 20, RSI_pd = 10, ADX_pd = 10):
symbol_list = ['SPY', 'AAPL', 'MSFT', 'AMZN', 'FB', 'GOOGL', 'GOOG', 'BRK.B', 'JNJ', 'JPM', 'V']
output = {}
for symbol in symbol_list:
print(symbol)
output[symbol] = get_data (symbol = symbol, SMA_short_pd = SMA_short_pd, SMA_long_pd = SMA_long_pd, RSI_pd = RSI_pd, ADX_pd = ADX_pd)
# Sadly we have to add these in otherwise the API starts rejecting our calls
if symbol != 'v': time.sleep(60)
return output
import os.path
from os import path, mkdir
# One of the problems with the API that we're using is that it is pretty slow. Because of this, we're going to store our data in a CSV to prevent further calls to it.
if not (path.exists("stock_data") and path.isdir('stock_data')):
mkdir('stock_data')
data_dict = get_top_data()
for symbol, data in data_dict.items():
name = "stock_data/" + str(symbol) + "_data.csv"
data.to_csv(path_or_buf=name)
import numpy as np
import glob
import re
import pandas as pd
import matplotlib.pyplot as plt
# This table stores the closing prices for each of our tickers
close_price_table = pd.DataFrame()
for filepath in glob.iglob('stock_data/*.csv'):
# Ignore the fundamental data
if 'funda' in str(filepath):
continue
else:
# Read and store the data for each ticker
temp = pd.read_csv(filepath)
start = 'stock_data/;'
end = '_data.csv'
s = filepath
company = s[s.find(start)+len(start):s.rfind(end)]
close_price_table[company] = temp['close']
# Adding a data column to our table.
# Since all the API pulls happen at the same time, the data for each ticker has the same date.
close_price_table['date'] = pd.read_csv('stock_data/AAPL_data.csv')['time']
close_price_table = close_price_table.reindex(index=close_price_table.index[::-1]).reset_index()
close_price_table.drop('index', inplace=True, axis=1)
close_price_table.head()
$Percent \; Change_{t} = \frac{R_{t-1} - R_{t}}{R_{t}}$ on day $t$ where $R_{i}$ is the i'th day's rate of return
pct_change = close_price_table.drop('date',axis=1).pct_change()
pct_change.plot(figsize=(12,8))
The cumulative sum at row $t$ is given by: $C_{t} = \sum_{i=1}^t PC_{t}$ where $PC_{i}$ is the the i'th day's percent change in daily returns
cumulative = pct_change.cumsum().plot(figsize=(12,8))
We would like to predict how future price fluctuations are determined through an analysis of previous historical data. To understand how stock daily returns are distributed and behave, we would like to see a distribution plot of price change. This is done using a facet grid from seaborn, and gives us a great intuition of the univariate distributions as they are plotted year by year.
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Convert the year column to a datetime object for the plot
pct_change['year'] = pd.to_datetime(close_price_table['date']).dt.year
dfm = pct_change.melt('year')
g = sns.FacetGrid(dfm, col='year', col_wrap = 6)
g = (g.map(sns.distplot, 'value'))
plt.xlim(-0.5, 0.5)
plt.show()
We see from the above facet wrap of distributions that the daily return for each year centers around zero or just below it and that the daily returns are normally distributed. We can also see that the daily return standard deviation increases during financially turbulent years, such as 2008 during the Housing Crisis, and 2020 during the COVID-19 epidemic.
Since we would like to see how fluctuations in price data of these top 10 stocks impacts the movement of the overall S&P500 market, we should categorize movement in S&P500 as rally, stressed, or neutral. If we observe that the fluctuations of the top 10 stocks' price data is statistically significantly impacting the S&P500 movement, it could help us determine which ones account for most of the movement. After ascertaining that there is a statistically significant correlation between the top 10 stocks and the S&P, we will build a feature set of technical indicators often used in the investment world: Short Simple Moving Average, Long Simple Moving Average, RSI (Relative Strength Index), and ADX (Average Directional Index) for each of our top 10 stocks and use these to predict movement in the S&P500. For the feature set of technical indicators, we will include the trend indicator, momentum indicator and volume indicator for multiple indicator strategy. After this we will compare the performance of multiple machine learning models with this prediction, along with cross-validation and hyper-parameter tuning of the models.
You can read more about multi-indicator strategy here.
# This dataframe will store the timeseries of the indicators for each symbol.
feature_table = pd.DataFrame()
for filepath in glob.iglob('stock_data/*.csv'):
# Ignore the fundamental data
if 'funda' in str(filepath):
continue
else:
# Read the features from each file into our feature_table dataframe
temp = pd.read_csv(filepath)
start = 'stock_data/;'
end = '_data.csv'
s = filepath
# Getting the company name
company = s[s.find(start)+len(start):s.rfind(end)]
feature_table["{}_short_SMA".format(company)] = temp['short_SMA']
feature_table["{}_long_SMA".format(company)] = temp['long_SMA']
feature_table["{}_RSI".format(company)] = temp['RSI']
feature_table["{}_ADX".format(company)] = temp['ADX']
feature_table["{}_volume".format(company)] = temp['volume']
# Adding a data column to our table.
# Since all the API pulls happen at the same time, the data for each ticker has the same date.
feature_table['date'] = pd.read_csv('stock_data/AAPL_data.csv')['time']
feature_table = feature_table.reindex(index=feature_table.index[::-1]).reset_index()
feature_table.drop('index', inplace=True, axis=1)
feature_table['SPY'] = pd.read_csv('stock_data/SPY_data.csv')['close'].pct_change()
print("Shape of feature set: {}".format(feature_table.shape))
feature_table.head()
def labeler(x):
# Here we use a daily positive change of .4% to specify rally.
if x > 0.004:
return "rally"
# A change of -.4% specifies stressed,
elif x < -0.004:
return "stressed"
# Otherwise, we will classify the market as neutral
else:
return "neutral"
feature_table['label'] = feature_table['SPY'].apply(labeler)
# Drop SPY daily returns from feature set
feature_table.drop('SPY', axis=1, inplace=True)
feature_table['label'].value_counts()
We see that the label distributions for neutral, stressed, and rally are fairly well-balanced and simulate a relatively realistic market condition distribution.
# Shift label by 1 day back so prediction will be predicting on the day after rather than predicting same-day classificaiton
feature_table['label'] = feature_table['label'].shift(-1)
# Because some indicators and tickers have a different shape, we will drop the NAs
feature_table = feature_table.dropna()
# Drop the label and date as they are not being used as features for our models.
X = feature_table.drop(['label','date'], axis=1)
y = feature_table['label']
print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))
We have 55 indicator features to input into our Classification models.
sns.countplot(feature_table['label'])
feature_list = ['short_SMA', 'long_SMA', 'RSI', 'ADX', 'volume']
for feat in feature_list:
temp = X.filter(like=feat, axis=1)
sns.heatmap(temp.corr(), annot=True, fmt='.1g')
plt.title("Correlation Heatmap of {} indicator for each stock in top 10 and SPY".format(feat))
plt.show()
From the correlation heatmaps, we see that the technical indicators of the top 10 stocks, in general, have a strong positive correlation with the associated indicator of the SPY index. Most notably, the two moving average and RSI features have a very strong positive correlation in comparison to ADX and volume. However, since they all demonstrate a power of explaining variance, we will keep all 5 indicator features for each stock to input into our machine learning classifiers.
We will be comparing the cross_val_score (mean of accuracies reported by 10-fold cross-validation) of 3 different classifiers: Random Forest, K-Nearest Neighbors, and Multilayer Perceptron in predicting 1-day future movement in the S&P500.
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np
cross_val_score(estimator, feature_set, target_variable, cross_validation_value) returns a list of 10 accuracy values (from a 10-fold cross-validation), which are then averaged to compute a single cross-validation score on each of the models.
classifier_names = ["Nearest Neighbors", "Random Forest", "Multilayer Perceptron"]
classifiers = [KNeighborsClassifier(), RandomForestClassifier(), MLPClassifier()]
print("--- Base Model Classifiers ---")
for name, clf in zip(classifier_names, classifiers):
print("CV Score for {} Classifier: {}".format(name, np.mean(cross_val_score(clf, X, y, cv=10))))
We see that without any hyperparameter tuning, the Multilayer Perceptron classifier performs the best out of the three, with K-Nearest Neighbors falling closely behind and the Random Forest performing quite poorly. The Multilayer Perceptron and K-Nearest Neighbors classifiers were able to get just below a random chance guess (1/3 or 0.33333) accuracy for the prediction of the S&P500 movement.
Hyperparameter tuning (or hyperparameter optimization) is the action of selecting optimal parameters for a machine learning estimator/model in order to gain additional predictive performance. A hyperparameter is a controlling factor in how a machine learning object is constructed, and needs to be optimally tuned in order to have the model optimally solve the problem. The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. We implement a GridSearchCV (cross-validated grid search) object to optimially tune the hyperparameters of the K-Nearest Neighbors, Random Forest, and Multilayer Perceptron Classifiers. For more information regarding hyperparameter tuning specifically in the realm of scikit-learn, read more here
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. For more information regarding the parameters and intuition behind the K-Nearest Neighbors Algorithm: K-Nearest Neighbors
leaf_size = list(range(1,20))
n_neighbors = list(range(1,20))
p=[1,2]
#Convert to dictionary
parameter_space = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
knn = KNeighborsClassifier()
# Perform 10-fold cross-validated grid search on KNN hyperparameters
knn_clf = GridSearchCV(knn, parameter_space, n_jobs=-1, cv=10, verbose=1)
# Fit the grid search model
best_knn = knn_clf.fit(X,y).best_estimator_
print(best_knn)
Random Forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. This makes it an "ensemble" of trees, that is a finite number of trees working to obtain better predictive performance than if we had only had one tree. A large number of relatively uncorrelated trees operating as a committee will outperform any of the individual constituent trees, as the trees protect each other from their individual errors. For more information about the parameters and intuition behind the Random Forest Algorithm: Random Forest
rfc = RandomForestClassifier()
parameter_space = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [100, 200, 300, 1000]
}
# Perform 10-fold cross-validated grid search on random forest hyperparameters
rfc_clf = GridSearchCV(rfc,parameter_space, n_jobs=-1, cv=10, verbose=1)
# Fit the random search model
best_rfc = rfc_clf.fit(X, y).best_estimator_
print(best_rfc)
Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function $f(\cdot): R^{m} \rightarrow R^{o}$ by training on a dataset, where is the number of dimensions for input and is the number of dimensions for output. Given a set of features $X = x_{1}, x_{2}, x_{3},....$ and a target $y$, it can learn a non-linear function approximator for either classification or regression. MLP trains using Backpropagation. More precisely, it trains using some form of gradient descent and the gradients are calculated using Backpropagation. For classification, it minimizes the Cross-Entropy loss function, giving a vector of probability estimates $P(y|x)$ per sample x. For more information about the MLP Classifier and Neural Networks in general, see: Neural Networks and MLP Classifier
mlpc = MLPClassifier()
parameter_space = {
'max_iter': [1500,2000,2500],
'hidden_layer_sizes': list(zip(np.arange(10, 60, 10),np.arange(10, 60, 10))),
'activation': ['tanh', 'relu', 'logistic'],
'solver': ['sgd', 'adam'],
'alpha': 10.0 ** -np.arange(1, 10, 2),
'learning_rate': ['constant','adaptive']
}
# Perform 10-fold CV Grid Search over MLP parameter space to tune parameters
mlp_clf = GridSearchCV(mlpc, parameter_space, n_jobs=-1, cv=10, verbose=1)
# Fit the grid search model
best_mlp = mlp_clf.fit(X, y).best_estimator_
print(best_mlp)
hyperparameter_tuned_names = ['Best K-Nearest Neighbors', 'Best Random Forest', 'Best Multilayer Perceptron']
hyperparameter_tuned_classifiers = [best_knn, best_rfc, best_mlp]
print('--- Hyperparameter Tuned Classifier Scores ---')
for name, clf in zip(hyperparameter_tuned_names, hyperparameter_tuned_classifiers):
print("CV Score for {} Classifier: {}".format(name, np.mean(cross_val_score(clf, X, y, cv=10))))
We can see from the hyperparameter tuning scoring results that the Neural Network Classifier is the best classifier out of the three, with the score being a 0.39, slightly better than random guessing on the movement of the S&P500.
A confusion matrix represents the combinations of predictions and actual values for a machine learning classification model. Graphically, it shows how often the model corrently and incorrectly predicts a label/category based off of the true value.
from sklearn.metrics import confusion_matrix
def plot_confusion_heatmap(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
df_cm = pd.DataFrame(cm, index= [i for i in ['Neutral', 'Rally', 'Stressed']], columns= [i for i in ['Neutral', 'Rally', 'Stressed']])
ax = sns.heatmap(df_cm, cmap="Blues", annot=True, fmt = ".3g")
ax.set(xlabel="Predicted Values", ylabel = "Actual Values")
return ax
y_true = y
y_pred = best_mlp.predict(X)
fig, axes = plt.subplots(1, 1, figsize=(8, 8))
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
fig = plot_confusion_heatmap(y_true, y_pred)
plt.show()
From the confusion matrix, we see that the multilayer perceptron classifier tends to weight heavy towards a prediction of "neutral" for the S&P500 1-day future movement. This makes sense as the "neutral" label falls in between "rally" and "stressed" numerically, and wrong classifications of "rally" and "stressed" will more than likely fall into a classification of "neutral" due to the fact that the classifier would have to skip the range of the "neutral" label and jump into the opposite trend if it were to wrongly predict "rally" for an actual "stressed" and vice versa. The classifier has identified this and when in doubt, will make the safe decision of "neutral" if it is going to predict a classification with low probability. Even though the classes are balanced in our original data, we do end up seeing an imbalanced set of predictions for the movement.
Learning curves demonstrate how a model's accuracy is impacted and changed by the number of training samples that are taken in for the input, and also display the scalability of a model by plotting its time to compute a fit on the input space as the input space gets larger.
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
if axes is None:
_, axes = plt.subplots(1, 2, figsize=(20, 5))
axes[0].set_title("Learning curve for {} Classifier".format(title))
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
# Plot learning curve
axes[0].grid()
axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
axes[0].legend(loc="best")
# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, 'o-')
axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std, alpha=0.1)
axes[1].set_xlabel("Training examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the model")
return plt
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
plot_learning_curve(MLPClassifier(), "Base MLP", X, y, axes=axes[:, 0], ylim=(0.05, 1.01),
cv=10, n_jobs=-1)
plot_learning_curve(best_mlp, "Best MLP", X, y, axes=axes[:, 1], ylim=(0.05, 1.01),
cv=10, n_jobs=-1)
plt.show()
We can see from the learning curve of the MLP classifier that as training examples increased, the base MLP classifier model accuracy remained relatively stagnant across the graph. On the other hand, the hyperparameter tuned MLP classifier model accuracy dipped once it hit around 500 training examples, and converged with the cross-validation score to a CV-score of 0.37. The scalability of the model didn't improve upon hyperparameter tuning, however the fit time is still within reasonable ranges.
In this tutorial we have:
As you can see, we have accomplished quite a lot. Although our results show that we are correct over 33% of the time (precisely 39% of the time), which is only slightly better than guessing, any improvement is noteworthy. The stock market is a notoriously hard system to predict. With large firms building, training, and inventing new algorithms every day, it gets continually harder to compete. As a result, we’re classifying this as a win. We hope you learned something, feel free to reach out if you have any questions.