7th Chapter Support Vector Machine
Support Vector Machine (SVM) is a two-class classification model of machines. Its basic model is a linear classifier that defines the largest interval in the feature space, and the support vector machine also includes the kernel technique, which makes it a substantial nonlinear classifier. The learning strategy of support vector machine is to maximize the interval , which can be transformed into a problem of convex two-time programming (convex quadratic programming), and it is also equivalent to minimizing the loss function of the hinge of hermetical. Support Vector Machine (SVM) learning algorithm is an optimization algorithm for convex two-times programming.
Support Vector Machines Learning model: linear linear support vector machines (linearly separable case), linear support vector machines (linear supported Vector mac Hine) and nonlinear support vector machines (Non-linear supports vector machine). The learning methods include: Hard interval maximization (HUD margin maximization), soft interval maximization (soft margin maximization), nuclear tricks (kernel trick). By using the kernel function, we can learn the nonlinear support vector machine and learn the linear support vector machine in the high-dimensional feature space by equivalent implicitly. Such methods are called nuclear techniques (kernel trick)
7.1 Linear scalable support vector machine with hard interval maximization
Definition 7.1 (linear separable Support vector machine): given a linear separable training data set, by maximizing the interval or equivalently solving the corresponding convex two-time programming problem, the separated hyper-plane is obtained
and the corresponding classification decision function
It is called linear and can be divided into support vector machines.
function interval and geometry interval
Definition 7.2 (function interval Functiona lmargin): for a given training dataset T and a hyper-plane (W, b), define the super-planar function interval for the sample point (xi, Yi) to
Defines the Super plane (w,b) The function interval for the training dataset T is the ultra-planar (w,b) Minimum of the function interval for all sample points (xi, Yi) in T, i.e.
The function interval can indicate the correctness and the degree of certainty of categorical predictions. But to change the W and b proportionally, for example, by changing them to 2w and 2b, the hyper-plane does not change, but the function interval becomes twice times the original.
For separating the normal vector of the super plane, plus some constraints, such as normalization, | | w| | =1, so that the interval is determined. At this point the function interval becomes the geometric interval.
definition 7.2 (geometric interval geometric margin): for a given training dataset T and Hyper-plane (W, b), define the super-planar function interval for sample points (xi, Yi) to
Defines the Super plane (w,b) The function interval for the training dataset T is the ultra-planar (w,b) Minimum of the function interval for all sample points (xi, Yi) in T, i.e.
The relationship between the function interval and the geometric interval:
If the hyper-plane parameter W and b are proportionally changed (the superelevation plane does not change), the function interval is also changed at this scale, and the geometric interval is unchanged.
Maximum interval
The basic idea of support vector machine learning is to solve the separation hyper plane which can correctly divide the training data set and the largest geometrical interval. For the linear separable training data set, the linear separable super-plane has infinitely multiple (equivalent to the Perceptron), but the separation hyper plane with the largest geometry interval is unique. The maximum interval here is also known as the hard interval maximization.
The visual explanation for maximizing the interval is that the hyper plane that finds the largest geometry interval on the training data set means that the training data is categorized with sufficient certainty, that is, not only the positive and negative instance points are separated, but also that the most difficult point of the instance (closest to the super plane) is sufficiently true to distinguish them.
The learning algorithm of linear scalable support vector machine-the maximal interval method (maximum).
theorem 7.1 (Maximum interval separation of the existence and uniqueness of the super-plane): If the training DataSet Z linear separable, the sample points in the training data set can be completely separated from the maximum interval separation of the super-plane exists and unique.
Support vectors and interval boundaries
In a linear separable case, an instance of the sample point in the training data set is called a support vector (supported vectors) from the closest sample point to the stagger of the detached hyper-plane. the support vector is the point at which the constraint equals an equal sign, i.e.
For yi=+1, the support vector is in the hyper plane
Negative example points for yi=-1, support vectors in the Hyper plane
The point on H1 and H2 is the support vector.
The distance between H1 and H2 is called the interval (margin). The interval depends on the normal vector w that separates the Hyper plane, equal to 2/| | w| |. H1 and H2 are called interval boundaries.
Only support vectors work when you decide to detach a hyper-plane, while other instance points do not work. If moving the support vector will change the solution you are seeking, but if you move other instance points and even remove them, the solution will not change. Because the support vectors play a decisive role in determining the separation Hyper plane, the classification model is called the support vector machine. The number of support vectors is generally very small, so support vector machines are determined by very few "important" training samples.
the dual algorithm of learning (dual algorithn)
Construct Lagrange functions (Lagrange function) and introduce Lagrange multipliers (Lagrange multiplier):
Based on Lagrange duality, the duality problem of primal problem is the minimax problem of Lagrange function.
theorem 7.2: set A A * is the solution of dual optimization problem, then there is subscript j to make aj* >0, and the solution of the original optimization problem can be obtained by the following formula:
Algorithm 7.2 (linear scalable support vector machine learning algorithm)
definition 7.4 (Support vector): an instance of the sample point (xi, Yi) that corresponds to the ai*>0 in the training data set is called the support vector. Support vectors must be on the interval boundary.
7.2 Linear support vector machine and maximum soft interval
For linear non-split training data:
Linear irreducible means that some sample points cannot satisfy the constraint of a function interval greater than or equal to 1, in order to solve this problem, a relaxation variable can be introduced to each sample point, and the constraint becomes:
The learning problem of linearly irreducible linear support vector machine becomes convex two-time planning (convex quadratic progamming) problem (original problem):
Definition 7.5 (linear support vector machine): for a given linear non-divided training data set, by solving the convex two-time programming problem, that is, the soft interval maximization problem. The resulting separation over-plane is
and the corresponding classification decision function
Called Linear support vector machines.
the dual algorithm of learning (Dual algorithn): Lagrangian function
By solving the duality problem, we get
theorem 7.3: set a A * is a solution to the dual optimization problem, there is a component 0<aj* <c, can get the solution of the original optimization problem:
Algorithm 7.3 (linear support vector machine learning algorithm)
Support vectors
In the case of linear irreducible, an instance of the solution of the dual problem (XI,YI) corresponding to the sample point of aj* > 0 (the support vector for soft interval) is called the support vector.
The support Vector XI of the soft interval is either on the interval boundary, or between the interval boundary and the detached hyper plane, or on the split side of the hyper plane.
If the a*<c is constrained, then the support Vector XI falls exactly on the interval boundary;
If the a*<c,0< constrained <1, then the classification is correct, Xi between the interval boundary and the separation of the super-plane;
If a*<c, constraint = 1, then Xi is on the separation Super plane:
If a*<c, the constraint >1, Xi is located in the separation of the super-plane on the wrong side.
Hinge Loss function
Linear support Vector machine learning there is another explanation, which is to minimize the following objective functions
The 1th item of the objective function is experience loss or empirical risk, function
Called Hinge loss function (hinge loss ftmction)
theorem 7.4 linear support vector machine The original optimization problem is equivalent to the optimization problem:
The hinge loss function is not only classified correctly, but also the loss is 0 when the confidence level is high enough.
7.3 Nonlinear support vector machines and kernel functions
Nuclear skills
Nonlinear classification problem: If the positive and negative examples can be correctly separated by a super-surface in Rn, the problem is called nonlinear separable problem.
The solution method: Nonlinear transformation, the nonlinear problem into a linear problem.
The basic idea of applying nuclear techniques to support vector machines is to use a nonlinear transformation to correspond the input space (Euclidean space Rn or discrete set) to a feature space (Hilbert space H), so that the hyper-surface model in the input space Rn corresponds to the hyper-planar model in the feature space H (support vector machine).
Define 7.6 (kernel function) set X is the input space, H is the feature space, if there is a map mapping function
The function K (x,z) satisfies the condition for all x,z belonging to X.
It is called K (x,z) as the kernel function.
The idea of nuclear techniques is to define only kernel function K (x,z) in learning and prediction, without explicitly defining mapping functions. For a given kernel K (x,z), the method of feature space X and mapping function is not unique, can take different feature space, even in the same feature space can take different mappings.
The application of kernel technique in support vector machine
The inner product (XI*XJ) in the objective function of the duality problem can be replaced by the kernel function K (xi, XJ):
Categorical decision functions can also be replaced with kernel functions, and become:
This is equivalent to the mapping function transforms the original input space into a new feature space and transforms the inner product (XI*XJ) in the input space into the inner product of the feature space. Learn the linear support vector machine from the training sample in the new feature space. When the mapping function is a nonlinear function, the learning support vector machine with kernel function is a nonlinear classification model.
The support vector machine for nonlinear classification problem can be solved by the method of solving linear classification problem under the given condition of kernel function. Learning is implicit in the feature space and does not require explicit definition of feature space and mapping functions. Such techniques are called nuclear techniques.
Sufficient and necessary conditions for positive definite nuclei (positive definite kernel function)
theorem 7.5 (necessary and sufficient conditions for positive definite nuclei) set K:X*X->R is a symmetric function, then K (x,z) is a positive definite kernel function of the necessary and sufficient conditions for any XI belongs to x,i=1,2,..., M, K (x,z) corresponding gram matrix
is a semi-definite matrix.
Definition 7.7 (equivalent definition of positive definite nucleus)
Set X is contained in Rn,k (x,z) for a symmetric function defined on x*x, if the x,z matrix corresponding to any XI belongs to x,i=1,2,..., M, K (gram)
is a semi-positive definite matrix, then the K (x,z) is a positive definite nucleus.
Common kernel functions
(1) Polynomial kernel functions (polynomial kernel function)
The corresponding support vector machine is a p-time polynomial classifier. In this case, the categorical decision function becomes
(2) Gaussian kernel functions (Gaussian kernel function)
The corresponding support vector machine is a Gaussian radial basis function (radial basis functions) classifier. In this case, the categorical decision function becomes
(3) String kernel functions (string kernel function)
Kernel functions can be defined not only in Euclidean space, but also on the set of discrete data. For example, a string kernel is a kernel function defined on a collection of strings.
The string kernel function on the two string s and T is based on the map
The inner product in the characteristic space
The string kernel function gas kn (s, t) gives the cosine similarity (cosine similuity) of eigenvectors consisting of all substrings of length equal to n in the string s and T. Intuitively, the more substrings of the same two strings, the more similar they are, and the greater the value of the string kernel function. String kernel functions can be calculated quickly by dynamic programming.
Nonlinear support vector Classification machine
definition 7.8 (nonlinear support vector machine) from the nonlinear classification training set, through the kernel function and the soft interval maximization, or the convex two times the plan, learns to obtain the classification decision function
Called Nonlinear support vectors, K (x,z) is a positive definite kernel function
7.4 Sequence minimum optimization algorithm (sequential minimal opfimization,smo)
The SMO algorithm needs to solve the dual problem of convex two-quadratic programming as follows:
The SMO algorithm is a heuristic algorithm, the basic idea is: if the solution of all the variables satisfy the KKT condition of the optimization problem (Karush-kuhn-tucker conditions), then the solution of this optimization problem is obtained. Because the KKT condition is the sufficient and necessary condition for the optimization problem. Otherwise, choose two variables, fix the other variables, and build a two-time planning problem for the two variables. The solution to this two-time planning question about these two variables should be closer to the solution of the original two-time programming problem, as this would make the target function values of the original two-time planning problem smaller. It is important that the sub-problem can be solved by the analytic method, which can greatly improve the computation speed of the whole algorithm. The sub-problem has two variables, one is the one that violates the most severe kkt condition, and the other is determined automatically by the constraint condition. In this way, the SMO algorithm decomposes the original problem into sub-problems and solves the problem, thus solving the original problem.
The whole SMO algorithm consists of two parts: the Analytic method and the heuristic method of selecting variables for solving two variables two times programming.
Solving method of two variables two times programming
Without losing its generality, assuming that the two variables selected are A1,A2, the other variables Ai (i=3,4, ..., N) are fixed. The sub-problems of the SMO optimization problem can be written as:
Assuming the initial feasible solution is a1old and a2old, the optimal solution is a1new and a2new, satisfying
L and H are the boundaries of the diagonal segment where the a2new is located.
If y1! = y2,
If y1 = y2,
Depending on the constraints, there are
The introduction of tokens,
Into the optimization problem, there are
Thus, the derivation of W can get
theorem 7.6 The solution of the optimization problem along the constraint direction without clipping is
The solution after the clip is
How to choose a variable
The selection of a 1th variable
SMO says that the process of selecting a 1th variable is an outer loop. The outer loop chooses the violation in the training sample
The KKT is the most severe sample point, and its corresponding variable as the 1th variable. In particular, the inspection
The training sample point (Xi,yi) satisfies the kkt condition, i.e.
The test is carried out within the scope. During the test, the outer loop first iterates through all the sample points that satisfy the condition, that is, the support vector points on the interval boundary to verify that they meet the kkt condition. If these sample points satisfy the kkt condition, then traverse the entire training set to verify that they meet the kkt condition.
The selection of a 2nd variable
SMO says that the process of selecting a 2nd variable is an inner loop. Assuming that the 1th variable A1 has been found in the outer loop, it is time to find the 2nd variable A2 in the inner Loop. The 2nd variable selection criterion is to make the A2 large enough to change.
By theorem 7.6 It is known that A2 is dependent on | E1-e2|, in order to speed up the calculation, a simple approach is to choose A2 to make it correspond to the | e1-e2| Max. Because A1 has been set, E1 also determined. If the E1 is positive, then select the smallest ei as the E2, and if the E1 is negative, select the largest EI as the E2. In special cases, if the A2 selected by the above method cannot cause the target function to fall sufficiently, the following heuristic rule is used to continue to select A2. Iterate through the support vector points on the interval boundary, then try the corresponding variable as A2, until the target function is sufficiently degraded. If a suitable A2 is not found, then the training data set is traversed, and if the appropriate A2 is still not found, the 1th A1 is discarded and the outer loop is sought for another A1.
Calculate threshold B and Difference ei
The test condition selected by the variable is available,
If
If
If A1new and a2new meet the conditions at the same time. If A1new and a2new are 0 or C, then b1new and b2new and the numbers between them are thresholds that meet the KKT criteria, then select their midpoint as bnew.
An update to the EI value is used for the bnew value, and for all support vectors corresponding to the AJ
where S is the set of all support vector XJ
Statistical learning Method Hangyuan Li---The 7th Chapter support Vector Machine