Support Vector Machines
Support vector machines are a supervised learning method used to perform binary classification on data. They are motivated by the principle of optimal separation, the idea that a good classifier finds the largest gap possible between data points of different classes.
For example, an algorithm learning to separate the United States from Europe on a map could correctly learn a boundary 100 miles off the eastern shore of the United States, but a much better boundary would be the one running down the middle of the Atlantic Ocean. Intuitively, this is because the latter boundary maximizes the distance to both the United States and Europe.
Background
The original support vector machines (SVMs) were invented by Vladimir Vapnik in 1963. They were designed to address a longstanding problem with logistic regression, another machine learning technique used to classify data.
Logistic regression is a probabilistic binary linear classifier, meaning it calculates the probability that a data point belongs to one of two classes. Logistic regression attempts to maximize the probability of the classes of known data points according to the model, and so, may place the classification boundary arbitrarily close to a particular data point. This violates the commonsense notion that a good classifier should not place a boundary near a known data point, since data points that are close to each other should be of the same class.
Maximum Margin Classifiers
Support vector machines, on the other hand, are non-probabilistic, so they assign a data point to a class with 100% certainty (though a bad SVM may still assign a data point to the wrong class). This means that two SVMs giving the same class assignment to a set of data points have the same classification accuracy. How then should we determine which of the two is better?
The answer lies in the size of the gap between data points of different classes. If two SVMs give the same class assignment of data points, we would like to choose the model whose closest data point is furthest away from its classification boundary. Ideally, the classification boundary will be a curve that goes right down the middle of the gap between classes, because this would be the classficiation boundary with the largest distance to the closest data point.
In the case of two linearly separable classes in the plane, this boundary would be a line that passes through the middle of the two closest data points from different classes. Passing through the midpoint of the line connecting two data points maximizes the distance to each data point. In more than two dimensions, this boundary is known as a hyperplane. This reasoning, which says that the best linear classifier is the one that maximizes the distance from the boundary to data points from each class, gives what are known as maximum margin classifiers. The classification boundary of a maximum margin classifier is known as a maximum margin hyperplane. Support vector machines are one such example of maximum margin classifiers.
Definition
The distance from the SVM's classification boundary to the nearest data point is known as the margin. The data points from each class that lie closest to the classification boundary are known as support vectors. If an SVM is given a data point closer to the classification boundary than the support vectors, the SVM declares that data point to be too close for accurate classification. This defines a 'no-man's land' for all points within the margin of the classification boundary. Since the support vectors are the data points closest to this 'no-man's land' without being in it, intuitively they are also the points most likely to be misclassified.
Thus, SVMs can be defined as linear classifiers under the following two assumptions:
- The margin should be as large as possible.
- The support vectors are the most useful data points because they are the ones most likely to be incorrectly classified.
The second assumption leads to a desirable property of SVMs. After training, the SVM can throw away all other data points, and just perform classification using the support vectors. This means that once classification is done, an SVM can predict a data point's class very efficiently, since it only needs to use a handful of support vectors, instead of the entire dataset. This means that the primary goal of training SVMs is to find support vectors in the dataset that both separate the data and find the maximum margin between classes.
Hard-Margin SVMs
Training an SVM is easiest to visualize in two dimensions when the classes are linearly separable. Suppose we are given a dataset \((x_1, y_1), (x_2, y_2), ..., (x_m, y_m),\) where \(y_i=-1\) for inputs \(x_i\) in class 0 and \(y_i=1\) for inputs \(x_i\) in class 1. Recalling the vector equation for a line in two dimensions, the classification boundary is defined as \(\vec{w} \cdot \vec{x}+b=0\), where \(\vec{w}\) and \(\vec{x}\) are two dimensional vectors. Furthermore, define the negative support vector to be the input vector \(\vec{x_n}\) from class 0 and the positive support vector to be the input vector \(\vec{x_p}\) from class 1.
Again recalling the vector equation for a line, define the negative classification boundary to be \(\vec{w} \cdot \vec{x_n}+b=-1\) and the positive classification boundary to be \(\vec{w} \cdot \vec{x_p}+b=1\). Then, the distance between the negative and positive classification boundaries is \(\frac{2}{||\vec{w}||}\). Thus, the size of the margin \(M\) is \(\frac{1}{||\vec{w}||}\).
Since we want to maximize \(M\), we have to minimize \(\vec{w}\). However, we also want to make sure that no points fall in the 'no-man's land', so we also need to introduce the constraints that \(\vec{w} \cdot \vec{x_i} + b \le -1\) for all \(\vec{x}_i\) in class 0 and \(\vec{w} \cdot \vec{x_i} + b \ge 1\) for all \(\vec{x_i}\) in class 1. This leads to following optimization problem:
Minimize \(||\vec{w}||\) subject to \(y_i(w \cdot x_i - b) \ge 1\) for \(i=1, ..., n.\)
This optimization problem can be solved using linear programming techniques.