Using support vector machines with scikit learn and python. Examining training and cross validation errors.
What is the best way to learn something?
To try to talk about it to others.
The process of “teaching” a concept, forces us to deal with details that we would not otherwise pay attention to and also fill gaps in our knowledge and understanding that we did not even realize we had.
In this article, we will experiment with Support Vector Machines (SVMs), a very well known supervised classification and regression algorithm, in order to understand fundamental machine learning concepts.
The mathematics behind the algorithm is beyond the scope of this post. We will be using python and scikit learn (sklearn). You will find a well commented code for this post here.
Let us first create a dataset to play with. Sklearn has built-in functions that can create a wide range of sample datasets for classification and regression. Here, we will be using the moons dataset.
The following image represents our dataset
These are 1000 points, split into two classes. The red points label is zero (0) and the green points one (1). We have added very little noise while creating the above dataset (variance 0.1). As a result, the two classes are very well separated.
In the image above, we also see the result of running SVM with a Gaussian (rbf) kernel over our dataset. We have used standard settings for our classifier, C = 1.0, gamma =1.0 (γ). Sometimes SVM algorithms are described in terms of the kernel variance σ. The relationship between the two is
As noted above, our classifier has no trouble splitting the two sets, which gives it little educational value.
Nevertheless, we will add it some value ourselves by examining the problem a little more.
First, let us figure out how many training samples we actually need for building a classifier for a dataset as the one above. How will we do this? We will be repeating the classification process, for larger and larger training sets and we will be plotting the training error as well as the cross validation error.
Before we do that, let us briefly explain what the above errors mean. The training error and cross validation error is just the algorithm error, also called cost function or loss function, computed over the training set and the cross validation set respectively. The reason we need the cross validation set, is because it does not make a lot of sense to evaluate parameters of our model over the set we train it with. The algorithm will fit the training set well and how well it fits it does not give any significant feedback. This is not always the case, meaning that the training set error combined with the cross validation error, can give us insight on whether our model is overfitting or underfitting the data.
Even though, this is not what we are examining yet, we will be calculating both errors, since we are rookies and we will be doing everything by the book until we get comfortable.
Before closing the talk about the various data set used in a machine learning algorithm, let me add a personal rule of thumb, according to which, apart from the training and cross validation sets, whenever you want to fine tune a model parameter, you will need yet another set of unseen data, call it the test set.
For example, let’s say that we tweak the parameter C above, until, by examining the training error and cross validation error, we see that our model behaves well, not overfitting or underfitting the data and with a good small cost (loss) value. Good job, but there is still the danger that the value we have come up with for our value C, “overfits” the cross validation set. In order to get out of this situation, we need to evaluate our final model with yet another set of unseen data, the test set. If it does not behave well, we have to start from the beginning and possible rethink the model we are using. If there are more parameters to experiment with, we should repeat the above process for them as well, until we get a good final evaluation. If this is not achievable, perhaps we need to choose a different machine learning algorithm or model or even ponder over whether our problem is learnable. That can happen as well.
In order to calculate the error, we just calculate the number of misclassified samples. We can use the score function of scikit-learn for this
Here are the plots for the errors for the model above
We see that the errors are both very low and converge very fast. The cross validation error shows that after about 100 samples, the algorithm has learned the problem quite well.
As we said above, our data have little noise and they are easily separable. So, let’s add a little more noise. This time, our data will have noise variance of 0.3.
Here is the data set and the classifier boundary regions
This is a much harder set, because the two classes do not have distinct boundaries. Inevitably, the errors will be significantly higher than before. Here are the plots
Again it converges fast, because the datasets have similar shape, but it converges to higher error, because of their hazy boundaries. The error at best is around 0.1 .
The above process gave us an idea about how many training samples we will need. It seems that after 450 samples the error converges to its lowest value.
High variance
Let’s add some real noise. This time we will give it variance 0.7. Here is the dataset
The error plots are the following
The error is higher as expected. We also see another thing. When the number of training samples is low, lower than 200, the algorithm has significantly higher cross validation error than training error. This means it it fits the training data well but not the cross validation data. This is a phenomenon called high variance, or overfitting.
We will talk about high bias and high variance in future posts. It is one of the most critical machine learning concepts.
The code that I used for this post is here.
Thank you for reading.