Machine learning Techniques--1–2 speaking. Linear Support Vector Machine

Last Update:2015-05-17 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The topic of machine learning techniques under this column (machine learning) is a personal learning experience and notes on the Machine Learning Techniques (2015) of Coursera public course. All the content is from Coursera public class machine learning techniques Hsuan-tien Lin Heights field Teacher's explanation. (https://class.coursera.org/ntumltwo-001/lecture)

1th talk-------Linear support Vector Machine

Based on the basic tools introduced by machine learning Cornerstone (mainly around feature conversion feature Transform), it is extended into a complex and practical model. On the one hand, how to make better use of the existing features and control complexity problem, has produced SVM (support vector machine) model; Furthermore, how to construct or combine some predictive feature to give a better performance to the whole model, resulting in a adaboost (step-up method) model In addition, how to learn the hidden features, such ideas stimulated the early years of neural networks developed into deep learning model.

One, the maximum boundary interval Super plane

In Perceptron, for training data as shown linearly, it is possible to get three separate lines such as left/middle/right, and perceptron cannot determine which of the three lines is better. Intuitively, the rightmost is the best, because it has the highest tolerance for noise. From the point of view, when testing the model due to the measurement error or the conditions of collection, even if the data should be identical to the training data in the collection may also be error, the figure of the circular area is noise, the larger the region means that the greater the noise can tolerate, on the other hand, from the line point of view, The distance between the super plane and the positive and negative samples is extended to the nearest positive and negative samples, which represents the robustness of the super plane, and the most right line gets the maximum interval.

Second, the problem of standardized description

In order to obtain the super-plane of the maximum boundary interval mentioned in the previous section, the problem can be formalized as the following optimization problem, where margin (b, W) represents the minimum distance from the sample point of the super plane WX + B, while the goal of the optimization problem is to maximize the minimum distance:

To further simplify this problem, considering that both W and b multiply at the same time do not affect the super-plane position changes, so you can always find a set of w* and b*, so that min y (w*x+b*) = 1, then there are:

Further, in order to simplify the minimization conditions in the optimization constraints, we can prove whether it is possible to convert to the form of inequalities. Assuming that the smallest y (wx + b) in the solution obtained after the conversion is no longer = 1, but rather a number larger than 1, it is clear that we can get a more optimized solution by simultaneously scaling the w,b so that the minimization of the equation in the constraints is equivalent to the inequality in the. Then finally, the simplification becomes the optimization problem.

Three, support vector machine

SVM is a kind of super-planar method to find the maximal boundary interval in the optimization problem. How to solve this optimization problem, fortunately, this problem form is consistent with the two-time plan (advanced version of Linear programming), so as long as our optimization problem is expressed as the standard form of two-time planning, then we can use the two-time planning method to solve. The general form of quadratic planning is as follows:

So as long as we represent the optimization problem as the general form of the above two-time plan, we can find out the mapping relationship of the variable q,p,a,c, as shown in. The solution can then be computed by any tool that implements the QP solution.

The optimization problem here is strictly called linear Hard-margin SVM Algorithm,hard-margin The training data must be linearly divided; linear indicates that the hyper plane we are looking for is the x of the original space, for the nonlinear problem, You can do two conversions or other conversions to X, and then solve them.

Iv. Theoretical basis for SVM

As we mentioned before, a visual explanation is that SVM is more tolerant of noise. Then from the angle of the VC bound (how much the super-plane can generate a combination of cross-fork classification), for example, for the PLA, can shatter three training points of all possible combinations (8, only 4); But for SVM, margin is limited, Then it is possible that all permutations can not be made, because the width of the margin is required.

Linear hard SVM can not shatter any 3 inputs, which shows that there are fewer dichotomies, smaller VC dimensions, there is a better generalization effect (e_in and e_out closer). At the same time, if you use feature transformations, you can make linear hard SVM perform some finer classifications.

In the solution of Linear hard SVM, the training data can be linearly divided, then the original problem is transformed into the QP problem by scaling plane and equivalent conversion. Data linearity can be divided in the actual situation is difficult to appear, so linear hard SVM application value is relatively limited. At the same time, the computational complexity of mapping the original data to other spaces is large in large cases when the feature is converted.

2nd talk-------Dual support Vector Machine

Lagrange Multiplier

In the last lecture, we talked about the model of linear SVM, and we describe how to use QP to find out the problem of the maximum interval of the super plane. In this talk we convert the same problem into another form, in order to make it easier to extend it to a wide variety of applications.

Linear constraints are annoying, inconvenient optimization, whether there is a way to put the linear constraints on the optimization problem itself, so that can be freely optimized, without regard to linear constraints. The basic tools used here are explained in the machine learning Cornerstone Course, where constraints are added to the target equation by the LaGrand day operator.

Similarly, the LaGrand day operator is introduced to transform the optimization problem with the constraint on the left side to the right, seemingly without a constraint condition.

The time to Witness a miracle is: The following optimization of the unconstrained Lagrange objective function is consistent with the original SVM solution. In fact, careful analysis is not difficult to prove.

For the Bad solution B and W, which is at least a violation of a certain constraint, then 1-y_n (w^t*z_n + b) will definitely > 0, multiply by an alpha of the same greater than 0, to its solution Max will definitely get positive infinity;
For the feasible solution B and W, all the constraints are met, then all 1-y_n (W^t*z_n + b) will be negative and all alpha is 0 when solving max;
Finally, the solution that does not meet the constraint condition is eliminated by Min, and the maximum boundary interval is selected as the final solution in all the solutions that meet the constraint conditions, so the constraint is now included in Max.

Finally, in an exercise, the Lagrangian thing is to add up the things we want to minimize and all the constraints:

Two, Lagrange dual form of SVM

The form of min (max) is still not easy to solve after the conversion of SVM problem in the previous section. However, a series of transformations can be used to get its relationship to the dual form max (min), and the minimum value in the maximum value is obviously greater than or equal to the maximum value in the minimum value:

The form on the right is often called Lagrange duality, where a relationship greater than equals tells us that if the Lagrange duality problem is solved, we get the lower limit of the solution in the original SVM problem, not exactly the solution of the original problem but can know at least how good the original problem can be.

In the optimization problem, a relationship greater than equals is called a weak duality. In fact, for the two-time planning problem here, the strong dual problem equals = is set when the following conditions are met.

The original problem is the convex problem
The original problem can be linearly divided
The constraints are linear

Fortunately, the above three conditions are sufficient for the original SVM problem. So let's take a look at how to solve the Lagrangian duality problem of the original SVM in detail.

For the min problem is no condition optimization problem, then the differential should be equal to the variable 0 results. First, the B is differentiated to obtain the results as shown.

Then, to bring this result into the original problem, you can reduce the part of the B to get the following questions. Further, the W is differential, the following results can be obtained.

In the same way, the result is brought into the objective function, and the simplification can be obtained in the following form. Since the resulting result does not already contain W and b, the optimization symbol for min can be removed.

Execution here, now the objective function is only related to alpha, the form satisfies the QP, can easily get the most optimized alpha, that is, to get W.

Now there is only one last question, how to calculate B? B was eliminated in the previous simplification process, and the QP solution above was irrelevant to B. KKT conditions help us to solve this problem. If both the original problem and the duality problem have the optimal solution and the solution is the same, then the KKT condition is satisfied, which is also a sufficient condition. First: It is necessary to meet the conditions of the original problem, which is certainly the feasible solution to the left problem; second: The condition of the duality problem needs to be satisfied; Thirdly: the idea that the duality problem should be satisfied is the result of the difference between W and b respectively; the idea that we need to satisfy the original problem If the constraints of the original problem are not violated, the optimization will cause Alpha to be 0, that is, W and b,alpha satisfying the condition to satisfy the following product 0. The last point is also the most important point, often called complementary slackness (complementary relaxation).

Therefore, according to the complementary relaxation of the KKT condition, the constraint equation of alpha and the original problem has one of two not 0 o'clock, the other must be 0. So, if you find an alpha that is not 0, then the equation containing W and B must be 0, so you can use this property to calculate B, and in practice, in order to reduce the error, all B can be calculated and then averaged. It is also worth noting that Alpha > 0 o'clock, which can calculate the point of B, satisfies the definition of the support vector in the original question, then these points are support vectors. Next to the KKT problem look at the following exercises, a little analysis of the correct answer is 4, it is noteworthy that the second option is deduced.

Third, the solution of dual form of SVM

According to the SVM duality problem deduced from the previous summary, the standard form of SVM dual problem can be obtained, as shown in. There are a total of n alpha variables, each alpha has a condition, all alpha together have a condition, so the total is n variables and n+1 constraints. We took the trouble to transform the original SVM-qp problem into a relative qp problem.

So how to solve this QP problem? As mentioned before, as long as the correct parameters are given according to the QP Solver, p, A, C can be, as shown here. It is worth noting that the standard form of QP is greater than equals and the condition on the left is the form of an equal sign, so the condition of the equal sign is disassembled to satisfy two conditions greater than or equal to and less than equals. In fact, many QP programs can represent the equation directly or you can specify the upper or lower bound of the condition, so it may not be necessary to disassemble such a complex operation for the equals sign.

It looks like it's easy, but not really. Q is a nxn matrix, and Q is not a sparse matrix, and a n=3w pen training data is required, so Q will need to spend more than 3G of storage space. It seems as if there is no original SVM problem so good solution, for the original SVM problem because the form of Q is very simple only in the diagonal position of the value. Therefore, it is not very possible to use the general QP program to solve, often need to specifically designed for SVM QP program. On the one hand, the whole Q matrix is not saved, it needs to be calculated when an element is used, on the other hand, the special conditions of SVM are used to accelerate the solving of QP problem.

Iv. philosophy behind the dual problem of SVM

In general, we will solve the SVM duality problem after the Alpha > 0 point is called the support vector, these points must be at the maximum interval of the boundary. For points with an alpha of 0 on the maximum interval boundary, it is called the candidate support vector, the supported vectors candidate. What's so important about support vectors? It is sufficient to calculate the w,b with only support vectors. Therefore, SVM can be regarded as a mechanism of finding support vectors through duality to learn the maximum interval of boundary hyper-plane.

We know from above that the calculation of W in SVM is actually the linear combination of YZ for Alpha, which is obtained by the duality problem. We've seen this before, and the PLA update on W is a similar mechanism. In the same vein, logistic regression, the final solution to the W obtained by linear regression, is a linear combination of the original data, called W, which can be represented by data. The magic of SVM is that only the support vectors in the training data are needed to represent them, and the PLA uses the points of error.

Summing up, these two talk about two kinds of SVM solution, primitive problem and duality problem. The number of variables in the original SVM is related to which space is mapped to, if the D_telta is very large or infinite, it is difficult to solve, through a certain free scaling to optimize the number of B and W;SVM dual problem variables is the number of training data N, Solve B and W by finding the support vector and the corresponding alpha.

So that's enough for you? Remember when we first extended the dual problem, we wanted to solve SVM but computational complexity did not want to be related to D_telta, because D_telta could be infinite. It seems that the dual problem of SVM is only related to N, in fact, D_telta is hidden in the calculation of the Q matrix, the time of each element in Q is calculated with the inner product of two z vectors, and the length of the z-vector is d_telta. So the next question is how to avoid this step in the Q matrix and listen to the tell.

more learning materials about machine learning techniques will continue to be updated, please follow this blog and Sina Weibo Sheridan.

Original articles such as reproduced, please specify this article link: http://imsheridan.com/mlt_1st_lecture.html

Machine learning Techniques--1–2 speaking. Linear Support Vector Machine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More