Day32 ML Review - Support Vector Machine (1)
Basic Concepts and Mathematical Formulations
Another powerful and widely used learning algorithm is the support vector machine (SVM), which can be considered an extension of the perceptron.
Basic Concepts
Using a support vector machine, we aim to maximize the margin, the distance between the separating hyperplane (decision boundary) and the training examples closest to this hyperplane, known as support vectors.
1. Hyperplane and decision boundary
- In a binary classification problem, SVM aims to find the optimal hyperplane that separates the data points of different classes.
- A hyperplane in an $n$-dimensional space is a flat affine subspace of dimension $n-1$. (For a 2D space, a hyperplane is a line; for a 3D space, it is a plane. )
2. Support Vectors
- Support vectors are the data points closest to the hyperplane and influence its position and orientation; they lie on the edge of the margin.
- The optimal hyperplane maximizes the margin, which is the distance between it and the nearest support vectors from both classes.
Mathematical Formulation - Maximum Margin Intuition
The rationale for having decision boundaries with large margins is that they are likely to have lower generalization errors, while models with small margins are more prone to overfitting.
When we see those positive and negative hyperplanes that are parallel to the decision boundary, which can be expressed as follows:
If we subtract those two linear equations (1) and (2) from each other, we get:
We can normalize this equation by the length of the vector $w$, which is defined as follows:
So, we delivered the following question:
The left side of the preceding equation represents the distance between the positive and negative hyperplane, known as the margin, which we aim to maximize.
Now, the objective function of the SVM becomes the maximization of this margin by maximizing $\frac{2}{\Vert \mathbf{w} \rVert}$ under the constraint that the examples are classified correctly, which can be written as:
$N$ is the number of examples in our dataset.
In practice, it is easier to minimize the reciprocal term, $\frac{1}{2}\Vert W \Vert^2$.
Leave a comment