In this post we'll use a support vector machine (SVM) to learn a classifier for the well-known iris dataset. Then we'll show how to do some model selection, varying a parameter in the SVM and comparing the models we get to try to get the "best" SVM we can.
First, we'll turn on graphs in the notebook, and import a few modules we'll be using.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Next we'll load in the data so we can start working on it.
Just to check, we'll look at the first few lines of the file to see what general format it is in.
f = open("../iris/iris.data")
for k in xrange(5):
print f.readline()
It looks like a comma-separated file without any column labels, so we'll read it in with pd.read_csv(). We can specify that there is no header information in the file (i.e. no column names in the first line of the file) with the header=None argument. Afterwards, we'll check that the data loaded correctly with the .head() function.
data = pd.read_csv("../iris/iris.data", header=None)
data.head()
It looks like it's loaded correctly.
To quickly describe the data, the features are in the first 4 columns, and the correct label in the 4th column. It's irrelevant for our discussion what the first 4 features represent. We'd like to use this data to build a model that predicts the label given the features.
The data is already laid out nicely for us, but we need to prepare a bit still.
Since we want to train a model to predict the label, we should separate the label from the rest of the features. We'll remove the labels column (4 above) and put it in its own vector y, leaving the features in a matrix X.
In a later step, we're going to randomly split the data into three sets. We'll talk about this in a moment, but we should randomly reorder the samples now, before we split the data into X and y, so that their orders are consistent.
# randomly permuting the rows of data
data = data.iloc[np.random.permutation(len(data))].reset_index(drop=True)
# separating off the label column
X, y = data.iloc[:,0:4], data.iloc[:,4]
Another thing we can do before we split the data into different sets is to normalize the features. We will normalize the features by subtracting the mean and dividing by the range of data we see within each given feature. In our data, that corresponds to subtracting the mean of each column from that column, and dividing each column by the range of values within that column.
We want to normalize our features because the optimization function for SVMs involves the distances of samples to other objects. The usual distance function is sensitive to the scales involved in each feature. Notice that if one feature had a much larger range of values than the others, then the distance function is effectively dominated by that "wider" feature. For example, $d((100,0.01), (200,0.05)) \approx d(100, 200)$. Consequently, the optimization would be inappropriately sensitive to the one "wide-range" feature. We could also account for this issue by using a function other than the usual distance function. Normalizing the data is equivalent and possibly simpler to talk about.
X = (X - X.mean()) / (X.max() - X.min())
For the purpose of building and testing our model, we will next separate the data into training, validation, and test data. We've chosen to split it with 60% of the data going to training, and 20% going to each of the other two subsets.
# compute the cutoff indices
n = len(X)
n_val, n_test = int(n*0.6), int(n*0.8)
# split the data
X_train, y_train = X.iloc[0:n_val, :], y.iloc[0:n_val]
X_val, y_val = X.iloc[n_val:n_test, :], y.iloc[n_val:n_test]
X_test, y_test = X.iloc[n_test:n, :], y.iloc[n_test:n]
Now, we'll import the (linear) support vector classifier from scikit-learn. The default settings of LinearSVC will train a multi-class SVM using the default one-vs-all approach. That is, for each of the 3 labels in our data, one SVM will be trained to separate that label from the other two labels. This yields three boundary hyperplanes. Predictions are made later by checking samples against each boundary, comparing the distances to the different boundaries to get a measure of which is most meaningful. Note that this is all done for us within the LinearSVC module.
We'll use this module to fit a model to our training data, and check the accuracy against our test set.
from sklearn.svm import LinearSVC
model = LinearSVC().fit(X_train, y_train)
score_train = model.score(X_train, y_train)
score_test = model.score(X_test,y_test)
print "Score on training data:\t%0.3f" % score_train
print "Score on test data:\t%0.3f" % score_test
The data set is pretty small, especially after we've split it into three pieces, but the model does alright.
Next, we'll talk about a general strategy for trying to choose a better model, and apply it in this setting. The small data set makes this a bit silly, so we just consider this for discussion's sake.
In general, we are interested in models that perform well when exposed to new data (i.e. data outside of X_train), not just models which accurately label the samples in X_train. We'll say a model is "good" if its predictions are accurate on new data. A natural thing to do then, is to train a variety of models on X_train and select the one that predicts most accurately on some new data. This is why we've bothered to create the set X_val. We can train our models on X_train, compare them against one another using X_val to select a winner, and then measure the performance of the winner on X_test.
We need to keep moving to new data sets for comparison and testing to help avoid biasing our results. If we compared models trained on X_train according to their performance on X_train, we might just end up selecting the most overfit model without regard for its ability to generalize to new data. If we measured the performance of the winning model on the data X_val used to select it, we may overestimate its general performance, since X_val might just be somewhere the model coincidentally does well, as we might suspect since it "won" there.
The SVM has a parameter C that adjusts the tradeoff in the optimization between trying to exactly separate the classes (i.e. aim for few misclassification, possibly by choosing a boundary very close to several samples) and trying to find a boundary with lots of space around it (i.e. aim for a large margin, possibly by allowing several samples to be misclassified).
Next, we'll try varying the regularization parameter C and seeing whether we find a better model. We'll carry out our selection by looping over several options for C, recording the training and validation scores along the way.
cs = [0.01,0.03,0.1,0.3,1,3,10,30,100]
print "| %-7s| %-14s| %-12s|" % ("C","X_train score","X_val score")
print "="*42
for c in cs:
model = LinearSVC(C=c).fit(X_train, y_train)
score_train = model.score(X_train, y_train)
score_val = model.score(X_val, y_val)
print "|%7.2f |%14.3f |%12.3f |" % (c,score_train,score_val)
We've found several options for C that seem to tie for best (another artifact of our small data set). We could have automated our selection of the best C by recording the C with the largest X_val score along the way. Here, we will just observe the results and choose C=10 for demonstration purposes.
best_C = 10.0
best_model = LinearSVC(C=best_C).fit(X_train, y_train)
Now we'll see how the model with this C performs on the test set.
score = best_model.score(X_test, y_test)
print "Score on test set: %0.3f" % (score)
We haven't measured any improvement on the test set by adjusting C. However, we should note again that because of the small amount of data available, these scores are influenced quite a lot by the particular random split we made into training, validation, and test data. In general, we should expect choosing an optimal C to improve performance.
Now we'll talk a bit about other ways we could improve our model.
First, we'll plot a learning curve. This is a graph showing how the training and test scores of our model vary as we vary the number of training samples. We can use such a graph to infer some things about whether training on more data would improve our model, or whether our model seems to be over- or under-fitting the training data.
Scikit-learn has a built-in module that we will use for this. The learning_curve function performs k-fold cross-validation to help make more use of the data, so we will supply it with a parameter cv=5, but we won't discuss this much. This cross-validation just helps to account for the possibility that a random separation into training and test data might be unlucky.
We'll import the module and apply it to LinearSVC with the max_c parameter we found above.
from sklearn.learning_curve import learning_curve
# train the model on a range of sample sizes between 10 and 120, increasing in steps of 10 samples
train_sizes, train_scores, test_scores = learning_curve(LinearSVC(C=best_C), X, y, train_sizes=range(10,120,10), cv=5)
Now we'll plot the learning curve. This snippet can essentially be found in the Scikit-learn docs for learning_curve. Remember that learning_curve performs some cross-validation, and as a result will compute several training and test scores, one for each (cross-validation) choice of separation into training and test sets. The snippet below nicely plots the means and standard deviations of these scores for each sample size.
plt.figure()
plt.title("Learning Curve")
plt.xlabel("Number of training examples")
plt.ylabel("Score")
# compute the means and standard deviations for each sample size
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
# use the standard deviations to graph colored bands
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
# plot the means
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.grid()
plt.legend(loc="best");
The position and gap between the scores suggests that, as the number of training examples increases, the scores are converging together to somewhere around 0.95 but the training score is still slightly outperforming the cross-validation score. Some more training examples might take advantage of this trend and get us better test scores.
However, both would still be around 0.95, and ideally we'd like test scores closer to 1 if possible. Our options then are to increase the model's ability to adapt to the data. For example, we could add features or use a more complex version of SVMs. Exploring the data some (which we have not done here) shows that while one of the classes is well-separated from the other two, the remaining two overlap quite a bit. One idea would be to adapt our SVM to allow us to tune the regularization parameters independently for the different separation tasks, or to switch to a different multi-class approach than one-vs-all. These efforts to add more variability to our model might help improve our ability to obtain higher test scores, but the increased complexity of our models would likely necessitate more training data as well.