Machine learning Algorithms and Python Practice (ii) Support vector Machine (SVM) Beginner

**Machine learning Algorithms and Python Practice (ii) Support vector Machine (SVM) Beginner**

[Email protected]

Http://blog.csdn.net/zouxy09

Machine learning Algorithms and Python practice this series is mainly referring to the "machine learning Combat" this book. Because I want to learn python, and then want to understand some of the machine learning algorithms, so I want to use Python to implement a few more commonly used machine learning algorithms. Just to meet the same location of the book, so reference to the process of this book to learn.

In this section, we mainly review the support vector machine system and implement it through Python. Because of the content, it is divided into three blog posts. The first one is about SVM, the second is advanced, the whole knowledge chain of SVM is straightened, and the third chapter introduces the implementation of Python. SVM has a lot of very good blog posts, can refer to the references listed in this article and recommended reading materials. In this paper, the positioning lies in the integration of the whole of the SVM's overall knowledge chain straightening, so does not involve the deduction of details. The online commentary is very good deduction and a lot of books, we can further reference.

**Directory**

First, the introduction

Two, the linear can divide the SVM and the hard interval maximization

Three, dual optimization problem

3.1, Dual problem

3.2. Dual problem of SVM optimization

Four, relaxation vector and soft interval maximization

V. Nuclear function

Six, multi-class classification of SVM

6.1, "one-to-many" method

6.2, "one-to-one" approach

Analysis of conditions of KKT

The SMO algorithm for the implementation of SVM

8.1. Coordinate descent algorithm

8.2. SMO algorithm principle

8.3. Python implementation of SMO algorithm

Ix. References and recommended readings

**First, the introduction**

Support Vector Machine (supportvector machines), the name is big, in machine learning or pattern recognition field but no one knows, no one does not know AH. In the 890 's, showdown with neural networks and attracted a large number of fans who were enthusiastic and followed. Although decades have passed, but the style of as ever, in the field of pattern recognition still occupy a lot of Jiangshan. The throne has been solid for decades. Of course, it also breeds a lot of descendants, there have been many genetically modified versions, but also developed a lot of nepotism. But the wisdom is still praised by the world, and will generations!

Well, I've been buying ads for so long, I don't know if it's overrated. We are still down-to-earth to see what the legendary SVM is. We know that the purpose of classification is to learn a classification function or a classification model (or a classifier) that maps data items in a database to one of a given category, which can be used to predict unknown categories. For the support vector machine used for classification, it is a classification model of two classification. In other words, given a sample set containing both positive and negative sample points, the purpose of the support vector machine is to find a super plane to divide the sample, and the positive and inverse examples in the sample are separated by the super-plane, but it is not simply divided, the principle is to make the interval between the positive and inverse examples is the largest. The goal of learning is to find a categorical hyper-plane wx+b=0 in the feature space, which is determined by the normal vector W and intercept B. The categorical hyper-plane divides the feature space into two parts, one is the positive class and the other is the negative class. The side of the normal vector is a positive class, and the other side is a negative class.

A small example is given for the classification of only two types of samples in one-dimensional space. Suppose we give the two types of points Class1 and Class2 (i.e. positive and negative sample sets) shown in the left image. Our task is to find a line that divides them. You will tell me, that simple, swipes a painting, voluminous colorful line on the out, and then very proud and I said, look, the right picture below, are the answer you want, if you still want, I can also draw you countless bars. Yes, that's right, you can draw countless bars. Which one is the best? You will ask me, how to measure "good"? Assuming that Class1 and Class2 are two villages, they are so stiff because of the division of the site between the two villages, that it is the fairest of them to tell you how to make a distinction. The "good" here is understood to be fair to both Class1 and Class2. Then you apart, pointing to the black line and saying, "It's it!" Normal people know! In the middle of the two villages to draw a line is obviously fair to them, no one think more, no one take less. This example may not be appropriate, but the truth is the same. For the classification, we need to determine a classification of the line, if a new sample arrives, if the left side of the line, then the sample is classified as Class1 class, if the right to fall on the line, it will be classified as Class2 this category. Which line is the best one? We still think it's the middle one, because then we think the new sample is the most credible, and the "good" here is credible. In addition, in two-dimensional space, the classification is the line, if it is three-dimensional, the classification is the face, the higher dimension, there is a domineering name called Super plane. Because it is domineering, the classification boundary of any dimension is generally referred to as the Super plane.

All right. For people, we can easily find this line or Super plane (of course, that is because you can see the sample distribution is how, if the dimension of the sample is larger than the three-dimensional, we will not be able to draw these samples like the above picture, this time can not see, then the eyes of the people can do nothing. "If I can see, life may be completely different, maybe I want, I like, I love, are not the same ...", but how does the computer know how to find this floss? How do we tell him our way of finding this line and let him find the floss in our way? Well, we're going to model!!! "Impose" our consciousness on a mathematical model of a computer, let him solve this model, get a solution, the solution is our line, then the purpose is achieved. Then you have to start the modeling journey.

**Two, the linear can divide the SVM and the hard interval maximization**

In fact, the above classification idea is the idea of SVM. Can be expressed as: SVM is trying to find a super plane to split the sample, the sample in the positive and inverse examples with the super-plane, but not very perfunctory simple separation, but do the best to make the interval between the positive and inverse of the largest margin. In this way, the results of the classification are more credible, and for the unknown new samples have a good classification predictive ability (machine learning beauty its name generalization ability).

Our goal is to look for a super plane so that the points closer to the super plane can have greater spacing. That is, we do not consider that all the points must be away from the hyper-plane, we are concerned about the super plane can make all points from its nearest point has the maximum spacing.

Let's start with a mathematical formula to describe it. Suppose we have N training samples {(** x ** 1, y1), (** x ** 2, y2), ..., (** x ** N, YN)}, ** x ** It's a D-dimensional vector, and Yi? {+1,-1} is the label for the sample, representing two different classes, respectively. Here we need to train with these samples to learn a linear classifier (super plane): F (** x **) =sgn (** w ** T ** x ** + b), which is ** W ** T ** x ** + B is greater than 0 when output +1, less than 0, output-1. SGN () indicates a symbol. The G ** (x **) = ** W ** T ** x ** + b=0 is the categorical hyper-plane we're looking for, as shown in. What did you say we were going to do? We need this hyper-planar maximum separation of these two categories. That is, the distance to the nearest sample of the two classes is the same, and the maximum. In order to better illustrate, we found two in the hyper-plane parallel to and equal to the plane: H1:y = ** w ** T ** x ** + b=+1 and h2:y = ** W ** T ** x ** + b=-1.

Well, then we need two conditions: (1) There is no sample between the two planes; (2) The distance between the two planes needs to be maximum. (For any H1 and H2, we can normalized the coefficient vector **W**so that the right side of the H1 and H2 expressions are +1 and 1 respectively). First look at the condition (2). We need to maximize this distance, so there are some samples in these two lines, they are called support vectors (which will be said about their importance later). So what is the distance? We studied in junior high school, two parallel distance of the method, such as AX+BY=C1 and AX+BY=C2, then their distance is |c2-c1|/sqrt (x2+y2) (sqrt () represents the open root). Note that both X and Y here represent the two-dimensional coordinates. and **W** to represent H1:w1x1+w2x2=+1 and h2:w1x1+w2x2=-1, that H1 and H2 distance is |1+1|/sqrt (w12+w12) =2/| | **W**| |. That is, twice times the reciprocal of the modulus of W. In other words, we need to maximize margin=2/| | **W**| |, in order to maximize this distance, we should minimize | | **W**| |, it looks so simple oh. At the same time we also need to meet the conditions (2), that is, to meet no data points distributed between H1 and H2:

That is, for any positive sample yi=+1, it should be on the right side of H1, that is, to ensure that: y= **w**T**x**+ b>=+1. For any negative sample yi=-1, it should be on the left side of H2, that is to say: y = **w**T**x **+ b<=-1. These two constraints can actually be combined into the same equation: Yi (**w**T**x**i + B) >=1.

So our problem becomes:

This is a convex two-time planning problem. What do you mean convex? A convex set is a set of points where any two points are connected to a straight line, and the point on the line is still inside the set, so that "convex" is very image. For example, for a convex function (in the mathematical representation, the satisfying constraint is the affine function, which is the linear ax+b form), the local optimality is the global optimal, but not for the non-convex function. The quadratic representation of the objective function is the two function of the independent variable.

Well, since it is a convex two-time planning problem, the optimal solution can be obtained through some of the optimized tools of QP (quadratic programming). So, our problem is solved at this point. Although this problem is really a standard QP problem, but it also has its special structure, through the Lagrange duality transformation to the dual variable (dual variable) optimization problem, you can find a more efficient way to solve, And often this is much more efficient than optimizing directly using a common QP optimization package. In addition to the conventional method of solving the QP problem, we can also apply Lagrange duality to get the optimal solution by solving the duality problem, which is the dual algorithm of support vector machine under the linear condition, the advantage of which is that the duality problem is often easier to solve, and the two can naturally introduce the kernel function. Then the problem of nonlinear classification is generalized. What is the duality problem?

**Three, dual optimization problem**

**3.1, Dual problem**

In the constrained optimization problem, the original problem is often converted to dual problem by Lagrange duality, and the solution of the original problem is obtained by solving the duality problem. The reference to this principle and derivation [3] speaks very well. Everyone can refer to the following. This is only how the dual problem is manipulated. Let's say that our optimization problem is:

Min f (x)

s.t. Hi (x) = 0, I=1, 2, ..., n

This is an optimization problem with an equality constraint. We introduce Lagrange multipliers to get the Lagrangian function as:

L (**x**, **α**) =f (x) +α1h1 (x) +α2h2 (x) +...+ΑNHN (x)

Then we will calculate the Lagrange function to the extreme value of x, that is, the derivation of X, the derivative is 0, you can get the α about X function, and then substituting the Lagrange function into:

Max W (**α**) = L (**x**(**α**), **α**)

At this time, the optimization problem with equality constraint becomes the optimization problem with only one variable **α**(multiple constraints are vectors), then the solution is very simple. The same derivative is equal to 0, the solution of **α** can be. It is important to note that we call the original problem primal problem, the converted form is called dual problem. It is important to note that the original problem is minimized and transformed into a duality problem and becomes the maximum value. For inequality constraints, the same operation is actually true. Simply put, by adding a Lagrange multiplier (Lagrange multiplier) to each constraint, we can integrate the constraints into the target function, which makes it easier to solve the optimization problem. (In fact, there are a lot of interesting things, you can refer to more posts)

**3.2. Dual problem of SVM optimization**

For SVM, as mentioned earlier, its primal problem is in the following form:

By introducing Lagrange multipliers in the same way, we can get the following Lagrangian functions:

The extrema of **W** and B are then obtained for L (**W**, B, **α**) respectively. That is, the gradient of L (**W**, B,**α**) to **W** and B is 0:∂l/∂**w**= 0 and ∂l/∂**b**= 0, which also needs to satisfy the **α**>=0. To solve the equation where the derivative is 0, you can get:

And then substituting the Lagrangian function, it becomes:

This is dual problem (if we know **Alpha**, we know **w**.) Conversely, if we know **W**, we can also know **α**). At this point we become the greatest for the **α** , that is, the optimization of the dual-variable **α** (no variable **w**, B, only **α**). When the optimal **α*** is obtained, the same formula can be used to derive the **w*** and b*, and finally the separation of super plane and categorical decision function is obtained. That is to train the SVM well. Then a new sample **x** can be categorized as follows:

Here, in fact, a lot of **like** are 0, that is to say **W** is only a few samples of the linear weighted value. This "sparse" representation is actually seen as a version of KNN's data compression. In other words, the new sample to be sorted first by the **W** and B to do a linear operation, and then see whether the result is greater than 0 or less than the judgement of the positive or negative example. Now that we have the **like**, we don't need to ask for **W**, we just do the inner product and all the samples in the new sample and training data. Then someone would say, is it too time-consuming to do calculations with all the previous samples? In fact, we get from the kkt condition, only the support vector **like** is not 0, other cases **like** are 0. Therefore, we only need a novelty sample and a support vector of the inner product, and then the operation can be. This notation is a good cushion for the kernel function (kernel) to be mentioned below. As shown in the following:

**Four, relaxation vector and soft interval maximization**

The situation we discussed earlier is based on the assumption that the distribution of samples is more elegant and linearly separable, in which case a near-perfect hyper-plane can be found to separate the two types of samples. But what if you encounter either of the following conditions? Left, a sample of negative Class A is not very gregarious, run to the right side of the way, this time if the above to determine the classification of the method, then you will get to the left of the red This classification boundary, well, it seems not very good, as if the whole world will be a. There is also the case of the right figure. A point of the class and a point of the negative class ran to the other people's door, this time can not find a straight line to separate them, then how to do? Are we really going to give in to these 0-ding, less obedient outliers? Just because their imperfect change our original perfect interface will not be worth the candle? But they have to be considered, so how can we compromise?

For the above-mentioned data points that deviate from the normal position, we call it outlier, it is possible to collect the training samples of the noise, it may be a certain standard data of the Uncle dozing wrong, the positive sample is labeled as a negative sample. In general, if we ignore it directly, the original separation of the super-plane is still very good, but because of the appearance of this outlier, resulting in the separation of the super-plane has to be squeezed crooked, and margin correspondingly smaller. More seriously, of course, if this outlier of the right figure is present, we will not be able to construct a super plane that separates the data linearly.

To deal with this situation, we allow the data points to deviate to a certain extent from the superelevation plane. That is, allow some points to run between H1 and H2, that is, they will be less than 1 of the interval to the classification surface. Such as:

In particular, the original constraint becomes:

At this point, we add a penalty to the target function, and the new model becomes (also known as the soft interval):

After introducing nonnegative parameter ξi (called relaxation variable), the function interval of some sample points is allowed to be less than 1, that is, in the maximum interval, or the function interval is negative, that is, the sample point is in the other's area. After relaxing the restriction, we need to readjust the target function to punish the outliers, and the second item after the objective function indicates that the more outliers, the larger the target function value, and the smaller the target functions we are asking for. Here c is the weight of outliers, the greater the C, the greater the impact of outliers on the target function, that is, the more you do not want to see outliers. At this time, the interval will also be very small. We see that the objective function controls the number and extent of outliers, so that most sample points still adhere to the constraints.

At this point, after the same derivation process, our dual optimization problem becomes:

At this point, we find that there is no parameter ξi, the only difference from the previous model is that **like** has more **like**<=c constraints. It should be recalled that the B's evaluation formula has also changed, and the results of the change are described in the SMO algorithm.

Machine learning Algorithms and Python Practice (ii) Support vector Machine (SVM) Beginner