In an ideal classification, we want to use a super plane to separate positive and negative samples. This super plane equation is $\mathbf{w}^t\mathbf{x}+b=0$ We hope that this super plane can make the division more robust, in the graphic representation of the super-plane is exactly in the positive class samples and negative class samples in the middle, using this idea, we introduced the SVM algorithm.
Why is the classification greater than or equal to 1 instead of 0
For hyper-planar $\mathbf{w}^t\mathbf{x}+b=0$,
Samples classified as positive on one side of the plane, meet $\mathbf{w}^t\mathbf{x}_i+b > 0, y_i = +1$
Samples classified as negative on the other side of the plane, meet for $\mathbf{w}^t\mathbf{x}_i+b < 0, y_i = -1$
We are always able to find a positive t, whether T is 0.001, or 1000 or another number, which makes
The sample classified as positive is $\mathbf{w}^t\mathbf{x}_i+b \geq T, y_i = +1$
Samples classified as negative are $\mathbf{w}^t\mathbf{x}_i+b \leq-t, y_i = -1$
Then we divide the two sides by the T, i.e. by scaling the transformation
The sample classified as positive is $\mathbf{w '}^t\mathbf{x}_i+b ' \geq 1, y_i = +1$
Samples classified as negative are $\mathbf{w '}^t\mathbf{x}_i+b ' \leq-1, y_i = -1$
For hyper-planar $\mathbf{w}^t\mathbf{x}+b=0$, and $\mathbf{w '}^t\mathbf{x}+b ' =0$, both represent a plane.
The above instructions explain a problem: we clearly know that the sample that is classified as positive is greater than or equal to 0, the sample with negative classification is less than or equal to 0, but many of the derivations are written in greater than or equal to +1, less than or equal to 1. The reason is that the scaling is processed.
After scaling, our following formulas are still used for convenience with symbols $\mathbf{w} and B instead of \mathbf{w '}, B ' $.
If we think about this problem further, there are two planar x+y+z-3=0 and 2x+2y+2z-6=0 that represent the same plane, but after substituting the same number, such as (2,2,2), the result is different, one is 3, the other is 6. If this is not the case, but the reason for this is that our coordinate system scale is not the same, the previous 3 if the need and 2 comparison can be divided into a positive class, then the next 6 will need and 4 to compare can be divided into positive class.
And our above scaling processing, is to compare all the resulting results and 1, and two nearest heterogeneous point distance from the plane of the sum of $\frac{2}{\vert\mathbf{w}\vert}$, our objective function is to make this value maximum. After some column processing, we get the initial optimization formula of the support vector
$\min \limits_{\mathbf{w}, B} \frac{1}{2}{\vert \mathbf{w} \vert}^2 $
$s. T. \ y_i (\mathbf{w}^t\mathbf{x}_i+b) \geq 1, i=1,2,..., M $
Optimization solution using the dual method and the SMO method
To solve the above problems, we need to use some optimization knowledge. The first is to use Lagrange multiplier method to find its dual problem (note: $\alpha$ below the subscript are vectors)
$L (\mathbf{w},b, {\alpha}) = \frac{1}{2} {\vert \mathbf{w} \vert}^2 + \sum \limits_{i=1}^{m}\alpha_i (1-y_i (\mathbf{w}^T \mathbf{x}_i +b)) $
Use "(\mathbf{w},b, \mathbf{\alpha}) $ for $\mathbf{w} and b$ for biasing, and for 0 you can get:
$\MATHBF{W} = \sum\limits_{i=1}^{m} \alpha_i y_i \mathbf{x}_i$
$ = \sum \limits_{i=1}^{m}\alpha_i y_i$
The above two equations are brought into the Lagrangian function,
Find the duality problem as follows:
$\max \limits_\alpha \sum\limits_{i=0}^{m} \alpha _i-\frac{1}{2}\sum\limits_{i=1}^m\sum\limits_{j=1}^m\alpha_i \ Alpha_j y_i y_j \mathbf{x}_i^t \mathbf{x}_j $
$s. T. \sum \limits_{i=1}^{m}\alpha_i y_i = 0$
$\alpha_i \geq 0, i=1,2,3,..., m$
If $\alpha$ is obtained, $\mathbf{w} and b$ can be obtained, the following results are obtained:
$f (x) =\mathbf{w}^t \mathbf{x} +b$
$=\sum \limits_{i=1}^{m}\alpha_i y_i \mathbf{x}_i^t x +b$
The following is a sequence minimization optimization algorithm (sequential minimal optimization, SMO) to solve the $\alpha$
The idea of SMO is that, by fixing other parameters other than the $\alpha_i$, the value of $\alpha_i$ is optimized, and by fixing the variables other than $\alpha_i$, the formula $\sum \limits_{i=1}^{m}\alpha_i y_i = 0$ can uniquely determine the $\alpha_i$, so we select two parameters $\alpha_i$ and $\alpha_j$ at a time to optimize so that all $\alpha$ can be calculated
When the value of B is evaluated, it can be solved according to the constraint of all support vectors $y_s f (\mathbf{x}_s) =1$.
Kernel functions:
Ideally, there is a plane in our primitive space that divides the positive and negative classes, but it is difficult to do in the real situation. We can look for a higher dimensional plane and map the data to the higher dimensions to divide it. For example, in a different or problem we have no way to find a line on the two-dimensional plane, at this time, we map the data to three-dimensional, in three-dimensional space can find a plane to divide the data.
So, when we re-optimize our problem, we want to use a function $\phi to map \mathbf{x}_i $ to a high-dimensional space. So the dual problem of the above becomes this:
$\max \limits_\alpha \sum\limits_{i=0}^{m} \alpha _i-\frac{1}{2}\sum\limits_{i=1}^m\sum\limits_{j=1}^m\alpha_i \ Alpha_j y_i y_j \phi{(\mathbf{x}_i)}^t \phi (\mathbf{x}_j) $
However, the problem with mapping to the high dimension is that the computational amount is too large, and this time we want to look for a function $\kappa$ the operation on the lower dimension, the result of which is the same as the result of mapping to the higher dimension. namely $\kappa (\mathbf{x}_i, \mathbf{x}_j) =\phi{(\mathbf{x}_i)}^t \phi (\mathbf{x}_j) $. Fortunately, we can find some of these functions based on some conditions, which are kernel functions. Each kernel function $\kappa$ also corresponds to a mapping $\phi$.
Common kernel functions:
Linear core $\kappa (\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^t \mathbf{x}_j$
Polynomial nuclear $\kappa (\mathbf{x}_i, \mathbf{x}_j) ={(\mathbf{x}_i^t \mathbf{x}_j)}^d$ $d \geq 1$ the number of polynomial
Gauss kernel $\kappa (\mathbf{x}_i, \mathbf{x}_j) = \exp (-\frac{{\vert \mathbf{x}_i–\mathbf{x}_j\vert}^2}{2 \sigma ^2}) $ $\sigma >0$ for Gaussian core bandwidth (width)
Laplace nuclear $\kappa (\mathbf{x}_i, \mathbf{x}_j) = \exp (-\frac{\vert \mathbf{x}_i–\mathbf{x}_j\vert}{2 \sigma}) $ $\ Sigma >0$
Sigmoid nuclear $\kappa (\mathbf{x}_i, \mathbf{x}_j) = Tanh (\beta \mathbf{x}_i^t \mathbf{x}_j + \theta) $ tanh is a hyperbolic tangent function $\be Ta >0, \theta <0$
Where the linear nucleus represents the non-transformation, the $\mathbf{x}_i^t \mathbf{x}_j$ is mapped to $\mathbf{x}_i^t \mathbf{x}_j$
Taking the two-time polynomial kernel as an example, we can get its mapping function:
$\kappa (\mathbf{x}, \mathbf{z}) = (\mathbf{x}^t\mathbf{z}) ^2$
= $\mathbf{x}^t\mathbf{z}\mathbf{x}^t\mathbf{z}$
= $\left (\sum \limits_{i=1}^{m} x_i z_i \right) \left (\sum \limits_{j=1}^{m} x_j z_j\right) $
= $\sum \limits_{i=1}^{m} \sum \limits_{j=1}^{m}x_i x_j z_i z_j$
= $\sum \limits_{i=1}^{m} \sum \limits_{j=1}^{m} (x_i X_j) (z_i z_j) $
= $\phi (\mathbf{x}) ^t \phi (\mathbf{z}) $
where $\phi (\mathbf{x}) = \sum \limits_{i=1}^{m} \sum \limits_{j=1}^{m}x_i X_j $
For example, a vector of $ (x_1; x_2; x_3) $ mapping later becomes $ (x_1x_1\; x_1x_2\; x_1x_3 \; x_2x_1\; x_2x_2 \; x_2x_3\; x_3x_1\; x_3x_2\; x_3x_3) $ put a 3-dimensional vectors are mapped to 9 dimensions above.
Fortunately, we have the kernel function, we only need to calculate on the 3-dimensional, instead of mapping to 9 dimensions and then the calculation, which saves a lot of computational capacity.
Soft interval and hinge loss
The problem discussed above is "hard interval", which is to classify all the samples correctly. In the actual process, we relax this restriction, not necessarily let all the samples meet $\ y_i (\mathbf{w}^t\mathbf{x}_i+b) \geq 1, i=1,2,..., M $ but we still hope that the less sample, the better, So we punish the samples that do not meet the above conditions and introduce the concept of loss function.
We use the above ideas to optimize our objective function, and we can deduce that our SVM algorithm actually uses the loss function called Hinge loss. This section is covered in my blog loss function SVM and hinge loss section.
Reference:
Support Vector Machine (tri) kernel function
"Machine learning" Zhou Zhihua
Introduction to SVM algorithm