Machine Learning and Finance

We have implemented a Support Vector Machine (SVM) to predict whether a stock will outperform the market based on its current key statistics and historical financial data. A SVM is a supervised machine learning algorithm which can be used for classification challenges, here we have a set of historical financial data, about 9000 records, where all records have been labelled as outperforming or underperforming the market. This historical data is used to train the SVM where each record is plotted in n-dimensional space, where n is equal to the number of features, here we have 35 features which is the number of key statistics. The SVM then performs the classification by finding the hyperplane that separates the two classes, that is, the hyperplane seperates the stocks that outperformed from underperformed. Once we have trained the SVM with the historical data we can then provide it with current, unclassified data and ask it to predict whether or not a particular stock will outperform the market based on its current key statistcis.

To try and make that a little bit more intuitive, where the number of features (n) is equal to 2 a line can be drawn between the classes to classify the data and where n equals 3 a plane can be used to seperate the classes. But when n is greater than 3 a hyperplane is used to seperate the classes, trying to visualise this hyperplane and higher order dimensions is not possible for us.

Historical Data

The table below is an extract of the historical data, the full csv can be downloaded here Download the CSV

Ticker DE Ratio Trailing P/E Price/Sales Price/Book Profit Margin Operating Margin Return on Assets Return on Equity Revenue Per Share Market Cap
0 eqt 31.14 NaN 6.54 1.82 23.46 6.53 0.48 1.00 10.41 1.237000e+10
1 adbe 25.62 49.64 9.55 7.53 19.96 25.49 7.63 16.20 11.75 5.404000e+10
2 flws 38.71 24.93 0.52 2.15 2.27 3.88 4.60 9.90 18.27 6.161600e+08
3 adm 40.20 16.63 0.41 1.45 2.53 2.39 2.30 8.90 105.00 3.143000e+10
4 s 196.43 NaN 1.03 1.79 4.49 6.30 1.61 7.51 8.27 6.483000e+10
5 aiz 25.87 10.51 0.67 1.25 7.23 11.43 1.95 13.06 130.53 5.240000e+09
6 hon 80.69 19.22 2.29 4.60 12.24 17.00 8.07 25.33 51.42 9.677000e+10
7 vtr 104.38 37.34 6.41 2.08 16.61 27.69 2.58 5.53 10.05 3.299000e+10
8 pfg 30.58 13.93 1.34 1.62 10.63 15.06 0.52 13.57 42.78 1.574000e+10
9 fti 48.02 7.23 0.38 0.76 5.38 10.30 4.10 11.36 87.80 2.030000e+09

The historical data that will be used to train the SVM are records of S&P 500 companies dating back to about 1998, it's pretty big at about 9000 rows by 35 columns, the critical part here is that we know wether or not the stock was outperforming or underperforming for each record. Before we get in to an actual analysis and prediction based on current data we can split the historical data into train and test sets to help fine tune the parameters for the SVM. In the tuning section below we will split off 1000 records from the historical data to be used as a test set with the remainder to be used as a training set. We then provide the SVM the training set, it builds its model and then we ask it to make predictions on the test set. Your probably thinking but hang on we already know wether or not a stock in the historical data outperformed or underperformed and you are correct, but that label is not provided to the SVM when making a prediction rather it is used to validate the accuracy of the predictions and therefore the parameters we have used.


All the processing is done on a virtual server in the cloud with a free account, so if there are a few people online it could be slow.

You can check out the SVM documentation here scikit-learn or just jump in and set the paramaters for the SVM below.

If you can get a return above 35% you are doing pretty well.

Decimal value between 0.01 and 10.0

Decimal value between 0.01 and 100.0, leave set to 1.0 for Linear Kernel


Now you have figured out the parameters that give the highest return leave them set in the Tuning section above. These will be used to train the SVM with the entire historical records data set. Then a set of current data pulled from Yahoo Finance on Mon Feb 6 03:20:25 2017 UTC will be run against the model. To keep the list of stocks somewhere handy the share on Facebook button will post them to your timeline.

You can download the current data that was used here Download the CSV