This column (Machine learning) includes single parameter linear regression, multiple parameter linear regression, Octave Tutorial, Logistic regression, regularization, neural network, machine learning system design, SVM (Support vector machines Support vector machine), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. All of the content came from Standford public class machine learning in the lecture of Andrew. (Https://class.coursera.org/ml/class/index)

The eighth lecture. Support Vector machine for machine learning--support Vector Machine

===============================

(i), the cost Function of SVM

(ii), Svm--large Margin classifier

(iii) Mathematical analysis of the reasons why SVM can form Large Margin classifier (choose to see)

(iv), SVM Kernel 1--gaussian Kernel

(v) The use of Gaussian Kernel in SVM

(vi), the use and selection of SVM

This chapter introduces the introductory explanation of support vector machine support vector Machine (SVM), introducing the concept of SVM in the understanding of general machine learning model. Originally a lot of people, including myself think SVM is a very magical concept, read this article you will feel, in fact, just have different objective functions, different models, Machine learning the essence has not changed, hehe ~

The completion of this article took me a long time, in order to understand the following there are procedures to facilitate and reference site Everyone experiment, I hope to help.

=====================================

(i), the cost Function of SVM

In the previous chapters we explained the cost function of linear regression, logistic regression and neural network respectively. Here we introduce SVM with the cost function of logistic regression.

First, recall the logistic regression model:

Or the original hypothesis, suppose we only have two classes, y=0 and Y=1. So according to the graph on H (x) above, we can see that

When Y=1, Hope H (x) ≈1, i.e. z>>0;

When y=0, Hope H (x) ≈0, i.e. z<<0;

Then the logical regression cost function formula is as follows:

Cost function We've talked about it before, and we don't repeat it here. Now, let's take a look at the two graphs below, where the grey curve is the Y=1 and y=0 of the logistic regression cost function,

Y=1, the 1,cost gradually decreases with the z↑,h (x) approaching.

Y=0, the 0,cost gradually decreases with the z↓,h (x) approaching.

This is the curve shown by the gray curve in the figure.

OK, now let's look at the definition of cost function in SVM. Take a look at the rose curve in the picture, which is the cost function curve we want to get, very close to the cost function of the logistic regression, but it's in two parts, and we'll explain the cost function in detail below.

Cost function for logistic regression:

Now, we give the definition of the objective function (cost function) of SVM:

In this formula, the Cost0 and Cost1 correspond to the objective function definitions of y=0 and Y=1 respectively, and the last regularization item is similar to that in the logistic regression. The sensory coefficient is less than what. Yes, actually their last item is the same, but the normalized term of SVM can be obtained by simplifying the linear transformation.

=====================================

(ii), Svm--large Margin classifier

This section gives a simple conclusion that--SVM is a large margin classifier. What is margin? The following is a detailed explanation and the theoretical proof will be given in the next section.

Before introducing margin, we review the SVM cost function curve in the previous section, as shown in the following illustration, where Y takes 1 and 0 o'clock. First of all, a conclusion is given that constant c is preferable to a large value (e.g. 100000).

We came to see that the C was very large, and that the part of [] was so small (that the part in [] was represented as W), so as to make it 0, then to analyze the formula inside:

※ Demand 1:

Y=1, W only the previous item, so that w=0, on the request Cost1 (ΘTX) = 0, from the right figure, this requires θtx>=1;

Y=0, W only the latter, so that w=0, on the request Cost0 (ΘTX) = 0, from the right figure, this requires θtx<=-1;

From the above note, the value of C should be in the classification of whether the error and margin the size of a balance. So what is the effect of C taking a larger value? That's the conclusion we started with.--SVM is a large margin classifier. So what is margin. In chapter Three we have talked about the decision boundary, which is the H (X) boundary that is able to classify all data points well. As the following illustration shows, we can think of any line in the Green Line, pink, blue Line, or black thread as decision boundary, but which one is the best. Here we can see, green, pink, blue these three kinds of boundary from the data very close, i.e. we add a few data points, it is possible that the boundary can be very good classification, and black decision boundary distance of two classes are relatively distant, What we hope to achieve is such a decision boundary. The margin is the distance between the two blue lines that the boundary is translated, as indicated in the figure.

Relative ratio:

C Small, decision boundary is presented as black line, if C is very large, it presents pink;

This conclusion can be remembered, can also be analyzed mathematically, in the next section we will analyze from the mathematical point of view, why SVM selection of large Valeu C will form a large margin classifier.

Then give a mathematical explanation of geometry margin:

The representation of the distance γ of any point x to the classification plane as shown in the above illustration, where Y is {+1,-1} represents the classification result, x0 is the shortest point on the classification plane, the equation of the classification plane is wx+b=0, and the x0 is brought into the equation with the result above. For a dataset X,margin is this data and all the points of the margin from the nearest hyperplane distance, the purpose of SVM is to find the maximum margin hyperplane.

Practice:

=====================================

(iii) Mathematical analysis of the reasons why SVM can form Large Margin classifier (choose to see)

This section is mainly to prove the conclusion of the previous section, why SVM is large Margin classification, can form a good decision boundary, if only in the application perspective of Friends can skip this section.

First we look at the representation of the two vector inner product. Assuming that vector u,v are both two-dimensional vectors, we know the inner product utv=u1v1+u2v2 of u,v. On the coordinates, as shown on the left of the following figure:

First, the V is projected to the U vector, and the length is P (positive or negative, positive with u, inverse negative, scalar), then the inner product of two vectors UTV = | u| | · || v| | · cosθ= | | u| | · p = u1v1+u2v2.

So, let's look at the cost function of SVM:

Because the C is very large, the cost function is only the one left behind. Take a simplified form, to explain the problem can be set θ0=0, only left θ1 and θ2,

Then cost function J (θ) =1/2x| | θ| | ^2

And according to the derivation above, there are Θtx=p | | θ| |, where P is the projection of X on Theta, then

※ Demand 2:

Y=1, W only the previous item, so that w=0, on the request Cost1 (ΘTX) = 0, from the right figure knowable, which requires P | | θ| | >=1;

Y=0, W only the latter, so that w=0, on the request Cost0 (ΘTX) = 0, from the right figure knowable, this requires P | | θ| | <=-1; as shown in the following illustration:

Let's focus on why SVM's decision boundary have large margin (here's a little bit more complicated, look good):

For a given dataset, the positive sample is still represented by X, O represents a negative sample, the Green line represents the decision boundary, the Blue line represents the direction of the theta vector, and rose represents the projection of the data on Theta.

We know that the angle of the boundary and the theta vector are 90° angles (you can see for yourself).

Looking at this graph first, for such a decision boundary (no large margin), θ and its 90° angle are shown, so that we can draw the projection of the DataSet X and O on Theta, as shown in the picture, very small; if you want to meet [requirement 2]

P | for positive samples | θ| | >=1,

p | for negative samples | θ| | <=-1,

You need to make | | θ| | Very large, this and the desire of the cost function (min 1/2x| | θ| | ^2), so SVM does not come out of this figure shown in the decision boundary results.

So look at the diagram below,

It selects the "better" decision boundary we have defined in the previous section, and the margin on both sides are relatively large. Look at the projections on both sides of the data to Theta, which is bigger, so you can make | | θ| | Relatively small, satisfies the cost function of SVM. Therefore, according to the cost function of SVM to solve (optimization) decision boundary must be large margin. Let's just say it's clear.

Practice:

Analysis: We can see from the graph that the optimal solution for decision boundary is y=x1, when the minimum projection of data in all datasets to Theta is 2, in other words, to satisfy

P | for positive samples | θ| | >=1,

p | for negative samples | θ| | <=-1,

Just need

2 | for positive samples | θ| | >=1,

Negative sample (-2) | | θ| | <=-1, so need | | θ| | >=1/2, based on the principle of minimizing the cost function, we know that | | θ| | =1/2.

=====================================

(iv), SVM Kernel 1--gaussian Kernel

For a nonlinear decision boundary, we used the polynomial-fitting method to predict:

F1, F2, ... fn for the extracted features. Define the predictive equation hθ (x) as a polynomial sigmod function value: hθ (x) =g (Θ0F0+Θ1F1+...+ΘNFN), where FN is the power item combination of X (pictured below) when Θ0f0+θ1f1+...+θnfn>=0 hθ (x) =1;else hθ (x) = 0;

So, in addition to defining the FN as the power item combination of x, is there any other way to represent F? This section introduces the concept of kernel, nuclear. The kernel function is used to represent F.

For the nonlinear fitting of the above graph, we calculate the kernel value f by calculating the similarity between the input original vector and the landmark.

It is found that the formula of similarity calculation is very similar to normal distribution (Gaussian distribution) right. Yes. This is the Gaussian kernel function. As can be seen from the figure below,

The more similar The X and L, the closer the F is to 1;

The farther the x differs from L, the closer the F is to 0;

The horizontal ordinate in the figure below is the two dimension value of x, and the height is f (new feature). Commanding Heights for the x=l situation, at this time f=1.

As the X and L are away, F gradually declines, reaching 0.

Here we look at the results of the SVM kernel classification prediction:

After the introduction of the kernel function, the difference in algebra is that f is changed, the original F is x1/x1^2/..., that is, the product of the XI power term

After the introduction of the kernel function, the geometry can be more intuitive to indicate whether the class should be classified (the following figure) For example, we want to divide all the data points on the coordinates into two categories (in the following figure) The red circle wants to be y=1, and the outer rim wants to predict as y=0. By training the dataset, we get a set of theta values (θ0,θ1,θ2,θ3) = ( -0.5,1,1,0) and three dots (L1,L2,L3), (in particular how to train the people to not be too entangled, later) for each test data set point, We first calculate it to (L1,L2,L3) the respective similarity, that is, the kernel function value (F1,F2,F3), and then bring into the polynomial Θ0F0+Θ1F1+...+ΘNFN calculation, when it >=0, the prediction result is the class interior point (positive sample, Y=1), else prediction is negative sample, Y =0