Support Vector machines are basically the best supervised learning algorithms, because their English name is SVM. In layman's terms, it is a two-class classification model, whose basic model is defined as the most spaced linear classifier on the feature space, and its learning strategy is to maximize the interval and finally transform it into a convex two-time programming problem solution.
(i) Understanding the fundamentals of SVMthe essence of 1,SVM--classification
Given some data points, which belong to two different classes, it is now time to find a linear classifier that divides the data into two categories-the most basic linear can be divided. If x is the data point, the Y is the category (Y can take 1 or-1, representing two different classes), the learning goal of the linear classifier is to find a demarcation in the n-dimensional data space so that the data can be divided into two classes, and the boundary equation can be expressed as (the T in WT means transpose, X is a data point (a row vector with M attributes), W is also a row vector of size m, and B is a constant):
On a two-dimensional plane, the above demarcation is a straight line, such as a line separating black and white dots. The demarcation of the three-dimensional plane will be a plane, and the higher dimensional plane will be the other demarcation forms, so this demarcation is called the super-plane (Hyper plane).
Figure 1
Again, we assume that the distribution of statistical samples is evenly distributed, so that in two classification categories (category 1 or 1) you can set the threshold to 0. In the actual training data, the sample is often unbalanced, and an algorithm is needed to select the optimal threshold value (such as Roc curve).
So the SVM classifier is to learn a classification function, when f (x) equals 0, X is the point on the super plane, and F (x) is greater than 0 points corresponding to the Y=1 data points, f (x) less than 0 points corresponding to y=-1 points. In other words, when classifying a new data point x into F (x), if f (x) is less than 0 assigns the class of X to-1, if f (x) is greater than 0, the class of x is assigned as 1,F (x), =0 cannot be divided.
The following two-dimensional plane as an example to illustrate the principle of SVM. It is not difficult to find the super-plane that can achieve classification (two-dimensional plane is a straight line) there will be a lot of bars, as shown in 2, how to determine which is the optimal super-plane? Intuitively, the optimal hyper-plane should be the best line for separating two types of data. The criterion for determining "best fit" is the maximum interval between the most recent data on both sides of the line, which is "the farthest from the nearest point in the sample to the super-plane". Therefore, we have to look for the "maximum interval" of the super plane. The following question is--how do I find the "maximum interval"?
Figure 2
2, calculates "maximum interval" based on geometric interval2.1 function Interval
For any data point (x, y), |wt*x+b| can represent the distance from the point x to the wt*x+b=0 of the plane, while the wt*x+b symbol is consistent with the symbol Y of the class to determine whether the classification is correct. Therefore, it is possible to determine the positive and negative of Y (wt*x+b) or to indicate the correctness of the classification (which is correct), which leads to the concept of function interval (functional margin). The function interval is defined as:
The minimum function interval for all sample points (Xi,yi) in the hyper-plane is the hyper-planar function interval for the training data set: mini (I=1,...N)
In fact, the function interval is the geometric distance formula for points to polygons.
2.2 Geometrical intervals
Suppose that for a point x, the corresponding point that is projected vertically onto the hyper-plane is x0, w is a vector perpendicular to the hyper-plane, the distance from the sample x to the classification interval, as shown in:
Have, | | w| | =wt*w, is the second-order universal number of W.
And because the x0 is the point on the super plane, satisfies the f (x0) =0 , the equation of the into hyperspace plane can be calculated:
The geometric interval of the data point to the hyper plane is defined as:
The minimum geometric interval for all sample points (Xi,yi) in the hyper-plane is the hyper-planar function interval for the training data set: min (I=1,...N)
The geometric interval is the function interval divided by | | w| |, can be understood as the normalization of the function interval.
2.3 Defining the maximum interval classifier maximum Margin Classifier
As mentioned earlier, the greater the "interval" from the data points, the greater the classification reliability (confidence), the higher the classification reliability, the more the super-plane that maximizes the "interval" value, and this interval is the maximum interval.
The function interval is not appropriate to measure the maximized interval value, since the super-plane is fixed by scaling the length of W and the value of B to be arbitrarily large. But in addition to the geometric interval, the values of the scale W and B will not change. So the geometrical interval only changes with the change of the super plane, and the "interval" in the super-plane of the maximal interval classification is measured by the geometrical interval. The objective function of the maximum interval classifier (maximum margin classifier) can be defined as:
(i=1,2,..., N),
According to the previous analysis, "The distance from the nearest point in the sample to the hyper-plane is farthest from the plane," and the transformation into a mathematical expression becomes a condition:
According to the previous discussion, even in the case of a super- planar fixation, the values can vary with the ∥w∥ . In order to simplify the calculation, it may be fixed to 1 (essentially the equivalent of the two sides of the same divided by the wt=wt ' =wt/,b=b ' =b/) , so the maximum interval classifier objective function evolved to:
In the formula, S.T. is subject to, it exports the constraints.
Since the maximum value is equal to the request (the reason for this conversion is to facilitate the solution, the coefficients 1/2 and squared are used to find the most convenient derivative, and there is no substantial meaning) of the minimum value, so the objective function is equivalent to (w from the denominator into a molecule, and thus the original Max problem becomes the min problem, it is clear that The two questions are equivalent):
3, definition of support vectors (supported vector)
SVM is called a support vector machine, and it has been discussed for so long that the meaning of ' support vectors ' is not yet clear. From 3 You can see two super-planes that support the middle clearance, which separates the super-plane distance from the middle (that is, we can get the maximum geometric interval max () equal), and the "support" of the two super These "support" points are called support vectors.
Obviously, because these support vectors (x, y) are just on the boundary, they meet (the function interval is fixed at 1), and for all points that are not support vectors, that is, the point behind the "position", there is obviously Y (wTx + b) > 1. In fact, when the optimal hyper-plane is determined, these rear points do not have any effect on the superelevation plane. One of the most immediate benefits of SVM is the superiority of storage and computing--only a small number of support vector points can be stored and calculated to complete the classification of new data. For example, if you use 1 million points to find an optimal hyper-plane, where there are 100 supporting vectors, then I just need to remember the information of these 100 points, and for subsequent classifications it is only necessary to use these 100 points instead of all 1 million points for calculation. Of course, in addition to the "memory-based learning" algorithm such as K-nearest neighbor, usually the algorithm does not directly use all the points to do the subsequent inference calculation.
4, fault tolerant relaxation factor outlier
The above-mentioned SVM construction process does not consider the data noise (that is, deviate from the normal position of data points), these points are called outlier. In the original SVM model, the hyper plane itself is only a few support vectors, the existence of outlier may have a great impact, if these support vectors exist outlier words, its influence is very big.
With the black circle up the blue Dot is a outlier, it deviated from the original should be in the half space, if directly ignore it, the original separation of super-plane is very good, but because of this outlier appearance, resulting in the separation of the super-plane had to be squeezed crooked, The black dashed line on the way is shown (this is only one, and the exact coordinates are not strictly calculated), and the margin is correspondingly smaller. Of course, the more serious case is that if the outlier moves a little farther to the right, we will not be able to construct a super-plane that separates the data.
To deal with this situation, SVM allows the data points to deviate to a certain extent from the superelevation plane. For example, the distance of the black solid line is the distance that the outlier deviates from, and if it is moved back, it will just fall on the original super-plane, without causing the super plane to deform. After adding the relaxation factor, the target function becomes:
It is called the relaxation variable (slack variable), which corresponds to the amount of functional margin that the data Point XI allows to deviate. Of course, if we allow arbitrarily large, the arbitrary hyper-plane is eligible. So, we add an entry after the original target function, so that the sum of these is minimized:
It is a parameter that controls the weights between the two items in the objective function ("Find the maximum margin of the" and "ensure the minimum deviation of the data points"). is a variable that needs to be optimized (one ), but a predetermined constant.
(ii) The solution of SVM1. Overview of the solution process
After the first section of the discussion, we have made clear that the purpose of SVM is to find a set of vector W and constant B, to form a super plane, to learn a classification function, in the classification of the new data point x into F (x), if f (x) less than 0 will be the category of X to 1, if f (x) A value greater than 0 assigns the category of X to 1,f (x) and =0 cannot be divided. SVM is to ask for a vector w and a constant B, and the solution process by converting Cheng Lagrange function, the final can be converted to the following objective function, can be used to solve the dual problem of the sequence of least optimal SMO algorithm to calculate the Lagrange factor, and then calculate the W and B. The detailed mathematical derivation process here is no longer the exhaustion, here only on the basis of the predecessors to give a more intuitive solution to the expression.
The value in the target function is:
, Ui=f (xi) is the estimated value of the i-statistic samples calculated from the current w,b combination
It is also possible to see that the data points outside of the support vector are 0, which means there is no meaning in the final classification function.
= T (Alpha*t (Y)) * X
B is progressively updated during the SMO solution iteration.
So the classification function is:
Taking a matrix representation is f (x) = T (x*t (x)) * (Alpha (cross-multiply) T (y)) +b, where Alpha (m,1), Y (m,1), X (M,n) of the participating operations are matrices, X represents M statistical samples, each statistic sample has n attributes, X (1,n) Represents a new data item for the classification to be estimated. * For matrix multiplication, T () indicates that the matrix transpose,<xi,x> represents the inner product.
2,smo Algorithm
For the specific derivation process of the SMO algorithm, refer to the material ' SMO algorithm derivation section '.
After a series of derivation, the problem of the objective function finally becomes: the minimum value of the following objective function is obtained.
In order to solve these multiplication, each time the two A1 and A2 are extracted from any of them, and then the other multiplier is fixed, so that the objective function is just a function of A1 and A2. In this way, the problem of solving the original problem is solved by continuously solving the sub-problems by randomly extracting two of them from a bunch of multiplicative children. The stopping condition of an iteration is that the A2 basically does not change or the total number of iterations reaches the maximum iteration count.
To solve this sub-problem, the first problem is how to choose A1 and A2 each time. In fact, one of the multiplier is the most serious choice of the illegal kkt condition, and the other multiplier is selected by another constraint condition. has an initial value of 0, so the first-pair multiplier at the beginning of the iteration is randomly selected. Therefore, the choice of A1 and A2 is to take the heuristic search method combined with the following constraints.
For A1, that is, the first multiplier, can be found by just saying that the 3 kinds of conditions do not meet the kkt to find;(i) <=1 but <c is not satisfied, and the original =c
(ii) >=1 but >0 is not satisfied, and the original =0
(iii) =1 but =0 or =c indicate unsatisfied and should have been 0<<c
for the second multiplier, the A2 can be found to satisfy the condition: the multiplier. (Ek=uk-yk) is the difference between the UK and the actual classification yk of the K-sample calculated from the current combination. The main idea of heuristic selection method is to select the AI of the front coefficient of the sample to optimize each time the Lagrange multiplier is selected (the paper is called the unbounded sample), because the coefficient ai corresponding to the example in the bounds (AI 0 or C) does not change normally. The heuristic search method is to select the first Lagrange multiplier, such as the previous A2. So if you choose, will it eventually converge. Fortunately, the Osuna theorem tells us that if there is one of the two AI selected that violates the KKT condition, the target function will decrease in value after one step iteration. Violating KKT conditions does not mean that there may be a breach in the field. Yes, so the algorithm for looping is:
(i) after a given initial value of ai=0, all samples are cycled, and the loop encounters an iterative update of the KKT conditions (both bounded and bounded).
(ii) The second round begins with an iterative update of the preferred sample until such a sample is entered (iii) without an AI update.
(iii) Re-enter (ii) once again after all samples have been cycled
(v) cycle (ii) (iii) until iteration terminates (maximum number of iterations is reached or no AI has been updated)
The final convergence condition is that the sample in the bounded () can follow the KKT condition, and its corresponding AI only changes within a very small range.
In addition, the update is subject to the second constraint, sum (ai*yi) = 0.
Therefore, if you assume that the selected two multiplier A1 and A2, which are before the update, are, after the update, respectively, then the value before and after the update needs to satisfy the following equation to guarantee the constraint and 0:
Where, is the constant.
Two factors are not good at the same time, so we can find the solution of the second multiplier A2 (), and then use it to represent the solution of A1 ().
The first step, the value range of the solution
In order to solve, we have to determine the range of values first, because the multiplier needs to meet the conditions. Therefore, it is necessary to calculate the upper and lower bounds according to the multiplier A1 and A2. Assuming that its upper and lower bounds are H and L, respectively, there are:
Next, combining the two constraints, the range of values to be obtained.
according to Y1 and y2 the same or different number, can be derived the upper and lower bounds are:
The second step is to solve
Using the yiai+yjaj= constant and eliminating the AI, a convex two-time programming problem of single variable AJ can be obtained, and the analytic solution of univariate (boundary condition is not considered) is available:
(indicates the difference between the predicted value and the real value), (if 0, the cycle is aborted)
Then consider the final solution of the constraint to be:
The third step is to solve
According to it, it can be obtained (it is not necessary to trim the boundary conditions here)
Fourth step, update b
B is updated under the following conditions: (Note: Here is a question, after clipping is bound to be within the boundary, so: (a) The third situation does not occur at all; (b) are the first and second sequences intentionally so arranged to determine the higher priority of B1? )
And after each update of the two multiplier a1\a2 optimization, it is necessary to recalculate B, and the corresponding EI value.
In addition, you can also get:
Then consider the constraint 0<=aj<=c to get:
For
Here , and the previous difference is that there is no clipping, is trimmed, B's update also has to change. This is consistent with the aforementioned approach in nature.
The last step, get the classification function
After the final iteration is complete, all AI, y, and B are updated to get the classification function:
(iii) kernel function kernel functionsThe preceding discussion is linear, and for linear irreducible situations it is necessary to take some means to make the data points into linear classification in another dimension, which is not necessarily visual display of the dimension. This method is the kernel function.
Using the ' Machine Learning Algorithm (2)-Support vector Machine (SVM) basis ' mentioned: There are no two identical objects in the world, and for all two objects, we can make a difference by adding dimensions to them, and when the dimension is added to the infinite dimension, It is certain that any two objects can be divided . For example, two books, from (color, content) Two dimensions, may be the same, we can add the author of this dimension, not to join the type, Book age, owner, place of purchase , and so on, will inevitably be able to distinguish between two books.
In the case of the linear non-separable, the support vector machine is first done in the low-dimensional space, then the input space is mapped to the high-dimensional feature space by the kernel function, and the optimal separation super-plane is constructed in the high-dimensional feature space, thus separating the non-linear data of the plane itself. For example, a stack of data on the left cannot be divided in two-dimensional space, and the right-hand map to three-dimensional space can be divided:
The kernel function can simplify the inner product operation in the mapping space. The local data vectors that need to be computed in the--SVM are always in the form of a product. In contrast to the formula we wrote just now, our classification functions are:
=t (k (x, X)) * (Alpha (cross-multiply) T (Y)) +b
Where a is computed by the following objective function:
This solves the problem of calculation, avoids the calculation directly in the high dimensional space, and the result is equivalent.
In practical application, according to the difference of the problem and data, choose different kernel function of parameter. The choice of the kernel is crucial for support vector machines, which is a continuous attempt to optimize the selection process. The kernel functions that are commonly used are:
1, the polynomial nucleus, the dimension of the space is, where m is the dimension of the primitive space.
2, Gaussian kernel, mapping primitive space to infinite dimensional space, is one of the most widely used kernel functions. Since this function is similar to the Gaussian distribution, it is called the Gaussian kernel function, also called the radial basis function (Radial Basis functions abbreviation RBF). By adjusting the parameters, the Gaussian core is actually quite flexible, and the example shown is that the low-dimensional linear irreducible data is mapped to the high-dimensional space by the Gaussian kernel function:
3, the linear kernel =x1*t (x2) is actually the inner product of the original space. This can be understood as the unification of the inner product calculation and the use of "placeholder". Code implementation without the distinction between linear, non-linear, for linear classification into the kernel function can be.
(iv) The Python implementation of the SMO algorithm SVM classifierThe SVM Classifier Python learning package consists of three. py files, svm/object_json/testsvm.py. Where svm.py implements a SVM classifier, testsvm.py contains two test cases. Because the training process takes a long time, the object_json.py is a custom JSON encoding function that permanently saves the classifier object, only the load classifier file is required for later use, unless the classifier needs to be updated.
The SVM classifier in the package defines two objects, Svmtrain and Svmclassifer, which, based on the training data, produce a SVM classifier through the SMO algorithm; the latter is only a SVM classifier, including support vectors generated by svmtrain, support vector set and get function , the classifier permanently saves support functions and so on.
The SVM classifier package for all source files and test files is:
TBD
(v) Application of SVM classification1, handwriting recognitionThe Digits.rar in the SVM classifier package is a handwriting recognition test case, and it is interesting to train the SVM classifier to test the recognition effect.
2, Text classificationText Categorization and SVM
3, Multi-classification introductionThe basic SVM classifier solves the problem of the 2 classification, the case of N classification has many ways, here is introduced 1vs (n–1) and 1v1. More SVM Multi-classification application introduction, reference ' SVM Multi-Class classification method '
In the previous method we need to train n classifiers, and the first classifier is to determine whether the new data belongs to the classification I or to its complement (except for the N-1 classification of i). The latter way we need to train N * (n–1)/2 classifiers, the classifier (I,J) is able to determine whether a point belongs to I or J, and when an unknown sample is classified, each classifier determines its class. and for the corresponding category "vote", the final category of the most votes is the category of the unknown sample. This approach is not only used in SVM, but is also widely used in many other classifications, from Professor Lin (LIBSVM's author) to the conclusion that 1VS1 is better than 1vs (n–1).
SVM More detailed theory derivation and application introduction:Support Vector Machine series
A popular introduction to support vector machines (understanding the three-tier realm of SVM)
Algorithms in Machine learning (2)-Support vector Machine (SVM) basics
SVM multi-Class classification method
Text Categorization and SVM
A detailed study of machine learning algorithms and python implementation--a SVM classifier based on SMO