one, linear can be divided into SVM
The SVM algorithm is originally used to deal with two classification problems, and is a kind of supervised learning classification algorithm.
For the linear Two classification problem, we can find an infinite number of super-planes and distinguish the two types of samples. (Hyper-Plane: a dimension is a point; two-dimensional is a line; three-dimensional is a face ...)
In the above multiple superelevation planes, they can all successfully divide the sample set on both sides, but which one is better.
In general, when the sample point is closer to the ultra-plane, the sample label for a certain type of probability should be about 0.5, the confidence degree is lower, and the sample point away from the super-plane, the label of the sample is a certain type of probability, the higher the degree of certainty.
The optimal super plane that can be found by linear SVM is to avoid all kinds of data points as far as possible, so that the interval (margin) is maximal and the optimal super-plane is obtained by maximizing the interval. The definition of the interval is shown in the following figure:
In general, the middle of the interval (Margin) is a non-point area. To favor a class of samples, the minimum distance between the super-plane and the two-type sample point set is equal, i.e. the interval equals the minimum distance between the super-plane and the two-class sample set. The greater the spacing of the super-plane, the less the probability of the wrong classification, the higher the reliability of classification. second, super-plane and interval
The next thing to do is to maximize the interval (Margin) to find the optimal hyper-plane. Prior to this, the superelevation plane needs to be defined so that the point to the superelevation plane distance is calculated.
The superelevation plane can be defined as a normal vector of the super plane, and the scalar B is the intercept.
Vector x is the characteristic vector of the sample. Vector x-point multiplication vectors can be understood as vector x is not normalized on the vector w projection.
By defining the superelevation plane in this way, it can be found that all the sample points at the top right of the superelevation plane have all the sample points at the bottom left of the hyper-plane.
Of course, all the points on the super plane have.
Further, we can scale the normal vector and intercept B by equal proportions, making:
wherein, for the first sample of the eigenvector, and the positive sample of the label Order 1, a negative sample of the label Order 1; So these two formulas are integrated into one equation:
。
The specific effect is as shown in the figure:
The "Support Vector" is the H1, H2 point on both sides of the interval, the ultra-planar H1, any point on H2 (support vector) to the distance of the demarcation plane. Therefore, the width of the compartment should be.
The specific formula is deduced as follows:
three, maximize the interval
In the above, we find the interval (margin) of the expression for, and also found the constraint of the vector W. That is, we need to get the maximum interval under the constraint conditions.
For ease of calculation, the maximum interval is transformed to minimize.
Finally, it can be transformed to the optimization problem, namely:
The minimum value that is obtained under constraint conditions.
We can solve this problem by Lagrange multiplier method. By the Lagrange multiplier method, a new function can be obtained as follows:
Which is the I multiplier, and is greater than or equal to 0.
The vector and B are then biased on this function, and the bias is equal to 0. The following two equations are available:
, you can get
, you can get
The two descendants of the above are entered into the L function, which are:
It can be found that the function L is related to the 22-point multiplication of the eigenvector of the sample in the training set.
Next, according to Lagrange duality, the duality problem of the original problem is the Minimax problem:
The next great thing to ask is:
S.T., greater than or equal to 0.
Transform a great problem into a minimal optimization problem that is converted to duality:
Same s.t., greater than or equal to 0.
The optimal solution can be obtained by finding the minimum/SMO algorithm for the bias and pair = 0 cases.
After obtaining the optimal solution, the optimal solution of the normal vector can be obtained, and then the "support vector" is put into the function, and the optimal solution B can be obtained by moving the term.
Support Vector Relationship: When the vector is not a "support vector", equals 0, when the vector is "support vector", not equal to 0.
Since only support vectors work when deciding to detach a hyper-plane, the model trained by SVM relies entirely on support vectors, even if the sample points of all non-supporting vectors in the training set are removed, the result will still be the exact same model.
So the superelevation plane can be expressed as:
The eigenvector of the sample to be tested is the label for the I Training sample (1/-1).
When the test sample is taken into D, if D < 0, the label of the test sample is-1; if d > 0, the label of the test sample is 1; four, from hard interval to soft interval
Sometimes, when there is a noise sample, the sample set of the training set cannot be strictly linearly divided (even if the kernel function is used later). As shown in the figure:
If we insist on considering all the sample points according to the original constraints, we can find the maximal interval between the positive and negative classes, which will make the whole problem without solution. This solution is also called the hard margin taxonomy, because it rigidly requires that all sample points meet and the distance between categories must be greater than a certain value.
As can be seen from the figure, the hard interval classification is susceptible to fewer points of control, in order to solve this control, can allow some points to the classification plane distance does not meet the original requirements.
Our hard constraints were:
In order to introduce fault tolerance, we introduce a relaxation variable to each sample point, and now our soft constraint condition is:
When certain points are special (compared to outliers), new constraints mean that we give up the exact classification of these points, which is a loss to the classifier. However, the loss of these special points also brings some benefits, that is, the super-plane does not have to move in the direction of these points, so you can get greater margin.
Because in order to weigh this loss and benefit, that is to reduce the loss and expand the interval, our objective function also needs to change, that is, we need to minimize the function as:
where n is the number of samples in the training set; After adding the loss to the target function, a penalty factor Cis required, and C is a super parameter of the model. This method is also known as the first-order soft-interval classifier .
When the order of the formula is 2 o'clock, that is, minimizing the loss function, this method is called a second class soft interval classifier .
And because of this, the loss function can be transformed into:
Hinge Loss is introduced here: when the sample point is too far from the boundary, the Hinge Loss of the sample point is 0.
The larger the penalty factor C, the smaller the number of sample points that are mistakenly classified. In the above formula, all sample points share a penalty factor, and of course, different sample points can correspond to different penalty factors. five, from linear to nonlinear
The above is only a description of the linear can be divided, and for the linear non-divided case:
The vector of the training set's sample in the space of the current dimension cannot find a hyper plane to distinguish it from. In this case, a method is a nonlinear mapping of eigenvectors, mapping to a higher dimensional space, and then in the high-dimensional space to find the optimal hyper-plane, but the computational complexity of the algorithm is very high, the other method is the kernel method (kernel trick), with a kernel function to replace the inner product of the vector after mapping solves the problem of complexity.
Specific examples can be shown in the following figure:
If the non-linear mapping is represented by F, then the kernel function is equivalent to the inner product of two vectors after the F-map: The computational complexity on the left side of the equation tends to be less than the computational complexity of the right.
An example of a comparison of complexity is shown in the following figure:
The common kernel functions are:
H-degree polynomial kernel functions:
Gaussian kernel function RBF:, where Gamma is a hyper-parameter.
In general, according to the prior knowledge to select the corresponding kernel function, can try different kernel functions, according to the experimental results to determine. Vi. Classification from two to multiple categories
On top of that, we're talking about using SVM to classify problems by two. For the multi-classification problem, by extending the SVM, through the One-vs-rest method to solve the multi-classification problem, such a solution will generally have a lot of SVM: how many kinds of tags, there will be how many SVM. Seven, the difference between SVM and LR
For linear models, the straight position of the SVM is not all training samples, but the support vectors, and the logistic regression model takes into account the effect of all training samples on the parameters in the training model. This can be seen essentially from the loss function.
In solving nonlinear problems, support vector machines adopt the mechanism of kernel functions, while LR is usually a method of polynomial features.
The loss function of logistic regression is the logistic loss, and the loss function of SVM is hinge loss: Eight, code example
Import NumPy as NP
import Matplotlib.pyplot from
sklearn import SVM
Np.random.seed (8) # Guaranteed Random Uniqueness
# linear divide:
array = np.random.randn (20,2)
X = np.r_[array-[3,3],array+[3,3]]
y = [0]*20+[1]*20
print x[0]< C6/>print x[20]
print y
[ -2.90879528-1.90871727]
[3.09120472 4.09128273]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
# build SVM model
CLF = SVM. SVC (kernel= ' linear ')
clf.fit (x, y)
SVC (c=1.0, cache_size=200, Class_weight=none, coef0=0.0,
Decision_function_shape=none, degree=3, gamma= ' auto ', Kernel= ' linear ',
max_iter=-1, Probability=false, Random_state=none, Shrinking=true,
tol=0.001, verbose= False)
X1_min, X1_max = X[:,0].min (), X[:,0].max (),
x2_min, X2_max = X[:,1].min (), X[:,1].max (),
xx1, xx2 = Np.meshgrid ( Np.linspace (X1_min, X1_max), Np.linspace (X2_min, X2_max))
# get vector w : w_0x_1+w_1x_2+b=0
w = clf.coef_[0]
f = w[0]*xx1 + w[1]*xx2 + clf.intercept_[0]+1 # plus 1 to draw-1 contour [ -1,0,1] + 1 = [0,1,2]
plt.contour (xx1, XX2, F, [0,1,2], colors = ' R ') # draws a delimited hyper-plane, H1, H2
plt.scatter (x[:,0],x[:,1],c=y,cmap=plt.cm.paired)
Plt.scatter ( Clf.support_vectors_[:,0],clf.support_vectors_[:,1],color= ' K ') # Draw support Vector point
plt.show ()
# non-linear can be divided into: from
sklearn import datasets
# load Data
iris = Datasets.load_iris ()
X = iris.data
y = iris.target
print Iris.target_names
[' Setosa ' versicolor ' virginica ']
From sklearn.model_selection import Train_test_split
X_train, X_test, y_train, y_test = Train_test_split (X, Y, TEST_SIZE=1/3.) # Split training set and test set
From sklearn.preprocessing import Standardscaler # normalization
scaler = Standardscaler ()
x_train_std = Scaler.fit_ Transform (x_train)
x_test_std = Scaler.transform (x_test)
From Sklearn.grid_search import GRIDSEARCHCV
# Cross-validation, adjustment parameters
Param_grid = {' C ': [1e1,1e2,1e3, 5e3,1e4,5e4],
' gamma ': [0.0001,0.0008,0.0005,0.008,0.005,]}
CLF = GRIDSEARCHCV (SVM. SVC (kernel= ' RBF ', class_weight= ' balanced '), param_grid,cv=10)
CLF = Clf.fit (x_train_std,y_train)
Print Clf.best_estimator_
SVC (c=10.0, cache_size=200, class_weight= ' balanced ', coef0=0.0, Decision_function_shape=none, degree=3,
gamma= 0.005, kernel= ' RBF ',
max_iter=-1, Probability=false, Random_state=none, Shrinking=true,
tol=0.001, verbose =false)
Clf.score (X_test_std,y_test)
1.0
y_pred = Clf.predict (X_TEST_STD)
From sklearn.metrics import classification_report from
sklearn.metrics import Confusion_matrix
Print (Classification_report (y_test,y_pred,target_names=iris.target_names))
print (Confusion_matrix (y_test,y _pred,labels=range (Iris.target_names.shape[0]))
Precision Recall F1-score Support Setosa 1.00 1.00 1.00 R 1.00 1.00 1.00 virginica 1.00 1.00 1.00 1.00 avg/total
1.00 1.00 # Recall indicates a recall rate = # (True positive)/(# (True positive) + # (False negative)), indicating how much of the sample is predicted correctly.
# Precision indicates the accuracy rate = # (True positive)/(# (True positive) + # (False negative)), which indicates how much of the sample predicted to be positive is the true positive sample. # F1-score (F1 indicator) indicates that the recall rate and the accuracy rate of two indicators of the harmonic average, recall and accuracy of the closer, the higher the F1 indicator. F1 = 2/(1/recall + 1/precision).
A learning model with too much recall and a precise gap in accuracy often does not have enough practical value. [[18 0 0] [0 17 0] [0 0 15]] The ordinate indicates who is predicted, the horizontal axis of the standard is who.
The larger the diagonal value, the better the predictive power.