Introduction

For my current NEAT project, I read the paper A Comparative Study of Supervised Machine Learning Algorithms for Stock Market Trend Prediction to learn how others approached stock market prediction using supervised machine learning algorithms. This paper compared the accuracy between Support Vector Machine, Random Forest, K-Nearest Neighbor, Naive Bayes, and SoftMax algorithms. After reading about these different algorithms, I took it upon myself to learn more about the scikit-learn python library. I am learning how to use scikit-learn because it will give me the ability to implement some the algorithms outlined in the paper. The algorithms I have covered so far include:
• Random Forest
• K-Nearest Neighbors
• Support Vector Machine

Random Forest

The first algorithm I sought to better understand was the Random Forest algorithm. This algorithm is made up of multiple decision trees. A decision tree learns simple decision rules from the data and produces a output. Each tree is used on a different sub sample of the dataset and averaging is used to improve accuracy. The decision trees all “vote” by producing outputs, and whatever the most popular output is becomes the algorithms output.

I first imported my data as a Pandas data frame and split it up into a training section and a testing section. I then created a Random Forest algorithm using the scikit-learn random forest classifier. After running the algorithm a few times, I noticed that the accuracy was unusually high. After some investigation, I discovered that the algorithms were predicting the stock price movement for the current day instead of one day in the future. After fixing the problem, the accuracy on the testing dataset stayed around 50%. Here is the code that I wrote:

K-Nearest Neighbors

The next algorithm I studied was the K-Nearest Neighbors algorithm. The nearest neighbor method looks for the most similar trading days. The predefined number of the most similar trading days become the nearest neighbors. Just like the decision trees in the Random Forest algorithm, the nearest neighbors “vote”. Instead of using decision trees to produce an output, the K-Nearest Neighbor algorithm uses the actual expected output for the most similar trading days. The most popular output becomes of the neighbors becomes the algorithms output. If 5 of the nearest neighbors had the closing stock price go up and 3 neighbors had the stock price go down, the prediction will be that the stock price goes up.

While studying this algorithm, I learned that K-Nearest Neighbor algorithms performs better with a lower number of features/inputs. To reduce the size of my dataset, I first normalized it with the standard scaler tool from scikit-learn. I then used PCA to shrink the number of inputs in my dataset down to two while preserving the variance in the dataset. Here is my K-Nearest Neighbors code:

After performing principal component analysis on my dataset, I graphed the principal components to better understand the data. After creating the graph, it became clear why the accuracy of my K-Nearest Neighbor algorithm stayed around 50%. Both expected outputs (Stock price going up or down) are clustered together. Support Vector Machine

A Support Vector Machine is a machine learning tool that uses a hyperplane to define data points. Instead of looking at the nearest neighbors like a KNN, I like to think that Support Vector Machines split up data points into different neighborhoods. For example, if we use the two components from my PCA analysis, a SVM will create a one dimensional hyperplane splitting up the data into two “neighborhoods”. The hyperplane will create decision boundaries that maximize the margins from both expected outputs. The graph below shows the data points and the decision boundaries created by my SVM. Any data point that falls into the blue background gets categorized as a 0 (closing stock price goes down). To create this graph, I used a Support Vector Machine with a RBF kernel. Kernel functions allow SVMs to create hyperplanes in high dimensional data without having to calculate the coordinates of the data in that space. Here is my Support Vector Machine Code: