SVM (III), Support Vector Machine, linear unpartitioned and kernel functions

Source: Internet
Author: User
3.1 linear division not allowed

The situations we discussed earlier are based on the assumption that the samples can be linearly divided. When the samples are linearly unavailable, we can try to use the kernel function to map features to high dimensions, this may be split. However, after the ing, we cannot guarantee that it can be split by 100%. So what should we do? We need to adjust the model to ensure that, when there is no score, we can also try to find a separation hyperplane.

See the following two figures:

We can see that an outlier (which may be noise) can cause the movement of the ultra-plane and reduce the interval. It can be seen that the previous model is very sensitive to noise. Furthermore, if an outlier is located in another class, this is linear.

At this time, we should allow some vertices to be free and violate the constraints in the model (the function interval is greater than 1 ). We designed a new model as follows (also called soft interval ):

When a non-negative parameter is introduced (called a relaxation variable), the function interval of some sample points is allowed to be less than 1, that is, within the maximum interval, or the function interval is negative, that is, the sample points are in the area of the other party. After the restrictions are relaxed, we need to re-adjust the target function to penalize the outlier. The value added after the target function indicates that the more the outlier, the greater the value of the target function, what we need is the smallest possible target function value. Here, C is the weight of the outlier. The larger the value of C indicates that the greater the influence of the outlier on the target function, that is, the less you want to see the outlier. As we can see, the target function controls the number and degree of outlier, so that most of the sample points still comply with the restrictions.

After the model is modified, you must modify the formula as follows:

Here, the sum is all the Laplace multiplier. Let's look back to the method we mentioned in the laganry dual, first write the formula (as above), and then regard it as a function of the variables W and B, evaluate the partial direction of the data, and obtain the expressions W and B respectively. In the formula, calculate the maximum value of the brought-in formula. The entire derivation process is similar to the previous model. Here, only the final result is written as follows:

At this point, we found that there are no parameters, and the only difference from the previous model is that there are additional constraints. It should be noted that the evaluation formula of B has also changed, and the change result is introduced in the SMO algorithm. First, let's take a look at the changes in the kkt condition:

The first formula indicates that the coefficient before the sample points outside the two interval lines is 0, the coefficient before the outlier sample points is C, and the Support Vector (that is, the maximum interval line on both sides of the superplane) the coefficient before the sample points is (0, c. The kkt condition shows that some sample points on the maximum interval line are not support vectors, but may also be outlier points.

Coordinate Ascent)

Before we finally discuss the solution, let's take a look at the basic principle of the coordinate rise method. Assume that you need to solve the following optimization problems:

Here w is a vector function. Previously we mentioned two methods for finding the optimal solution in regression: gradient descent and Newton. Now let's talk about another method called the coordinate rise method (the coordinate descent method is called when solving the minimum value problem, the same principle ).

Process:

The statement at the beginning means to fix all the except. In this case, W can be regarded as just a function, so we can directly evaluate and optimize it. Here, we perform the maximization of the order I from 1 to M. We can change the optimization order so that w can increase and converge more quickly. If W can quickly reach the optimal value in the inner cycle, the method of increasing coordinates is a very efficient method for extreme values.

The following figure shows the image:

An ellipse represents the contour lines of quadratic functions. The number of variables is 2, and the starting coordinate is (2,-2 ). The straight-line Iterative Optimization Path in the figure shows that each step moves forward to the optimal value, and the forward route is parallel to the coordinate axis, because each step only optimizes one variable.

  3.2 kernel functions (kernels)

Definition3.1(Core or positive core) Is a subset in which the defined function is a kernel function. If there is a ing from to the Hilbert space

(1.1)

To enable,

Both are true. It indicates the Inner Product in the Hilbert space.

 

Considering the problem we raised in "linear regression", the feature is the area X of the house, where X is a real number, and Y is the price of the house. Assuming that X and Y conform to three curves from the distribution of sample points, we want to use the cubic polynomials of X to approach these sample points. First, we need to extend feature X to three dimensions, and then look for a model between the feature and the result. We call this Feature Transform Feature Mapping ). The ing function is called. In this example

We hope to apply the features after feature ing to SVM classification, rather than the initial features. In this way, we need to map the Inner Product in the preceding formula.

As to why we need to map features instead of the original features for calculation, the above mentioned (to better fit) is one of the reasons, another important reason is that there may be linear inseparable situations in the example. After a feature is mapped to a high-dimensional space, it can be divided. (In the introduction to data mining, Pang-ning tan and others, the chapter "Support Vector Machine" provides a good example)

Formally defines the kernel function. If the Inner Product of the original feature is mapped to the kernel function, the kernel function is defined

Here, we can draw a conclusion that if we want to achieve the effect at the beginning of this section, we only need to calculate it first and then calculate it. However, this calculation method is very inefficient. For example, if the initial feature is n-dimensional, we map it to a dimension and then calculate the time required. So can we try to reduce the computing time?

Let's look at an example. Assume that X and Z are both N-dimensional,

After expansion

In this case, we can only calculate the square of the original feature X and Z Inner Product (time complexity is O (n), which is equivalent to the Inner Product of the feature after the ing. That is to say, we don't need to spend time.

Now let's take a look at the ing function (n = 3). Based on the above formula, we get

That is to say, the kernel function can be equivalent to the Inner Product of the ing feature only when such a ing function is selected.

Let's look at another core function.

The corresponding ing function (when n = 3) is

Generally, the feature dimension after the ing of the core function is. (For the solution method, see the http://zhidao.baidu.com/question/16706714.html ).

Because we calculate the inner product, we can think of the cosine similarity in IR. If the angle between X and Z is smaller, the greater the value of the kernel function, and vice versa, the smaller the value. Therefore, the value of the core function is similar to that of the kernel function.

Let's look at another core function.

At this time, if X and Z are very similar (), then the core function value is 1. If X and Z differ greatly (), then the core function value is approximately 0. Because this function is similar to Gaussian distribution, it is also called a Gaussian Kernel Function (RBF ). It can map original features to infinite dimensions.

Since the Gaussian Kernel function can compare the similarity between X and Z and map it to 0 to 1, we can recall the logistic regression, the sigmoid function can, so there are also sigmoid kernel functions and so on.

The following figure shows that, when low-dimensional linearity is not available, it can be divided after ing to high-dimensional. Gaussian Kernel function is used.

Slides from Eric Xing

Note: after using the kernel function, how can we classify new samples? In linear mode, we use SVM to learn W and B. If the new sample X is used, we use SVM to judge that if the value is greater than or equal to 1, it is a positive class and less than a negative class. The two are considered uncertain. If the kernel function is used, it becomes. Do you need to find it before prediction? The answer is definitely no. It's very troublesome. Let's look back at what we said before.

You only need to replace the value with the same value.

Determining the validity of core functions

Q: Given a function K, can we use K to replace computing? That is to say, can we find one so that all X and Z have one?

For example, whether K can be considered as a valid kernel function.

Next we will solve this problem. Given M training samples, each of which corresponds to a feature vector. Then, we can bring any two sums into K for calculation. I can be from 1 to M, J can be from 1 to M, so that the kernel function matrix (kernel matrix) of M * m can be calculated ). For convenience, we use the kernel function matrix and K for representation.

If K is a valid core function, it is defined according to the core function.

It can be seen that matrix K is a symmetric array. Let's draw a stronger conclusion. First, we use a symbol to represent the K-dimensional attribute value of the ing function. For any vector z

The last step is similar to the previous calculation. From this formula, we can see that if K is a valid kernel function (I .e. equivalent), then the kernel function matrix K obtained in the training set should be semi-definite ()

In this way, we obtain the necessary conditions for a kernel function:

K is a valid kernel function ==> the kernel function matrix K is symmetric and semi-definite.

Fortunately, this condition is fully expressed by the Mercer theorem.

MercerTheorem:

If function K is a ing on (that is, ing two n-dimensional vectors to the real number field ). If K is a valid kernel function (also known as the Mercer kernel function), then the corresponding kernel function matrix is symmetric and semi-definite only for the training sample.

The Mercer theorem indicates that in order to prove that K is a valid kernel function, we do not need to look for it. Instead, we only need to find each of them in the training set, then, you can determine whether the matrix K is semi-positive. (use the upper-left corner primary formula to be greater than or equal to zero.

Many other textbooks use the concepts of norm and regenerative Hilbert space in the proof of Mercer theorem, but the proof given here is equivalent when the feature is n-dimensional.

Kernel functions are not only used in SVM, but can be replaced frequently when they appear in a post-model algorithm, which may greatly improve our algorithm.

3.1 linear division not allowed

The situations we discussed earlier are based on the assumption that the samples can be linearly divided. When the samples are linearly unavailable, we can try to use the kernel function to map features to high dimensions, this may be split. However, after the ing, we cannot guarantee that it can be split by 100%. So what should we do? We need to adjust the model to ensure that, when there is no score, we can also try to find a separation hyperplane.

See the following two figures:

We can see that an outlier (which may be noise) can cause the movement of the ultra-plane and reduce the interval. It can be seen that the previous model is very sensitive to noise. Furthermore, if an outlier is located in another class, this is linear.

At this time, we should allow some vertices to be free and violate the constraints in the model (the function interval is greater than 1 ). We designed a new model as follows (also called soft interval ):

When a non-negative parameter is introduced (called a relaxation variable), the function interval of some sample points is allowed to be less than 1, that is, within the maximum interval, or the function interval is negative, that is, the sample points are in the area of the other party. After the restrictions are relaxed, we need to re-adjust the target function to penalize the outlier. The value added after the target function indicates that the more the outlier, the greater the value of the target function, what we need is the smallest possible target function value. Here, C is the weight of the outlier. The larger the value of C indicates that the greater the influence of the outlier on the target function, that is, the less you want to see the outlier. As we can see, the target function controls the number and degree of outlier, so that most of the sample points still comply with the restrictions.

After the model is modified, you must modify the formula as follows:

Here, the sum is all the Laplace multiplier. Let's look back to the method we mentioned in the laganry dual, first write the formula (as above), and then regard it as a function of the variables W and B, evaluate the partial direction of the data, and obtain the expressions W and B respectively. In the formula, calculate the maximum value of the brought-in formula. The entire derivation process is similar to the previous model. Here, only the final result is written as follows:

At this point, we found that there are no parameters, and the only difference from the previous model is that there are additional constraints. It should be noted that the evaluation formula of B has also changed, and the change result is introduced in the SMO algorithm. First, let's take a look at the changes in the kkt condition:

The first formula indicates that the coefficient before the sample points outside the two interval lines is 0, the coefficient before the outlier sample points is C, and the Support Vector (that is, the maximum interval line on both sides of the superplane) the coefficient before the sample points is (0, c. The kkt condition shows that some sample points on the maximum interval line are not support vectors, but may also be outlier points.

Coordinate Ascent)

Before we finally discuss the solution, let's take a look at the basic principle of the coordinate rise method. Assume that you need to solve the following optimization problems:

Here w is a vector function. Previously we mentioned two methods for finding the optimal solution in regression: gradient descent and Newton. Now let's talk about another method called the coordinate rise method (the coordinate descent method is called when solving the minimum value problem, the same principle ).

Process:

The statement at the beginning means to fix all the except. In this case, W can be regarded as just a function, so we can directly evaluate and optimize it. Here, we perform the maximization of the order I from 1 to M. We can change the optimization order so that w can increase and converge more quickly. If W can quickly reach the optimal value in the inner cycle, the method of increasing coordinates is a very efficient method for extreme values.

The following figure shows the image:

An ellipse represents the contour lines of quadratic functions. The number of variables is 2, and the starting coordinate is (2,-2 ). The straight-line Iterative Optimization Path in the figure shows that each step moves forward to the optimal value, and the forward route is parallel to the coordinate axis, because each step only optimizes one variable.

  3.2 kernel functions (kernels)

Definition3.1(Core or positive core) Is a subset in which the defined function is a kernel function. If there is a ing from to the Hilbert space

(1.1)

To enable,

Both are true. It indicates the Inner Product in the Hilbert space.

 

Considering the problem we raised in "linear regression", the feature is the area X of the house, where X is a real number, and Y is the price of the house. Assuming that X and Y conform to three curves from the distribution of sample points, we want to use the cubic polynomials of X to approach these sample points. First, we need to extend feature X to three dimensions, and then look for a model between the feature and the result. We call this Feature Transform Feature Mapping ). The ing function is called. In this example

We hope to apply the features after feature ing to SVM classification, rather than the initial features. In this way, we need to map the Inner Product in the preceding formula.

As to why we need to map features instead of the original features for calculation, the above mentioned (to better fit) is one of the reasons, another important reason is that there may be linear inseparable situations in the example. After a feature is mapped to a high-dimensional space, it can be divided. (In the introduction to data mining, Pang-ning tan and others, the chapter "Support Vector Machine" provides a good example)

Formally defines the kernel function. If the Inner Product of the original feature is mapped to the kernel function, the kernel function is defined

Here, we can draw a conclusion that if we want to achieve the effect at the beginning of this section, we only need to calculate it first and then calculate it. However, this calculation method is very inefficient. For example, if the initial feature is n-dimensional, we map it to a dimension and then calculate the time required. So can we try to reduce the computing time?

Let's look at an example. Assume that X and Z are both N-dimensional,

After expansion

In this case, we can only calculate the square of the original feature X and Z Inner Product (time complexity is O (n), which is equivalent to the Inner Product of the feature after the ing. That is to say, we don't need to spend time.

Now let's take a look at the ing function (n = 3). Based on the above formula, we get

That is to say, the kernel function can be equivalent to the Inner Product of the ing feature only when such a ing function is selected.

Let's look at another core function.

The corresponding ing function (when n = 3) is

Generally, the feature dimension after the ing of the core function is. (For the solution method, see the http://zhidao.baidu.com/question/16706714.html ).

Because we calculate the inner product, we can think of the cosine similarity in IR. If the angle between X and Z is smaller, the greater the value of the kernel function, and vice versa, the smaller the value. Therefore, the value of the core function is similar to that of the kernel function.

Let's look at another core function.

At this time, if X and Z are very similar (), then the core function value is 1. If X and Z differ greatly (), then the core function value is approximately 0. Because this function is similar to Gaussian distribution, it is also called a Gaussian Kernel Function (RBF ). It can map original features to infinite dimensions.

Since the Gaussian Kernel function can compare the similarity between X and Z and map it to 0 to 1, we can recall the logistic regression, the sigmoid function can, so there are also sigmoid kernel functions and so on.

The following figure shows that, when low-dimensional linearity is not available, it can be divided after ing to high-dimensional. Gaussian Kernel function is used.

Slides from Eric Xing

Note: after using the kernel function, how can we classify new samples? In linear mode, we use SVM to learn W and B. If the new sample X is used, we use SVM to judge that if the value is greater than or equal to 1, it is a positive class and less than a negative class. The two are considered uncertain. If the kernel function is used, it becomes. Do you need to find it before prediction? The answer is definitely no. It's very troublesome. Let's look back at what we said before.

You only need to replace the value with the same value.

Determining the validity of core functions

Q: Given a function K, can we use K to replace computing? That is to say, can we find one so that all X and Z have one?

For example, whether K can be considered as a valid kernel function.

Next we will solve this problem. Given M training samples, each of which corresponds to a feature vector. Then, we can bring any two sums into K for calculation. I can be from 1 to M, J can be from 1 to M, so that the kernel function matrix (kernel matrix) of M * m can be calculated ). For convenience, we use the kernel function matrix and K for representation.

If K is a valid core function, it is defined according to the core function.

It can be seen that matrix K is a symmetric array. Let's draw a stronger conclusion. First, we use a symbol to represent the K-dimensional attribute value of the ing function. For any vector z

The last step is similar to the previous calculation. From this formula, we can see that if K is a valid kernel function (I .e. equivalent), then the kernel function matrix K obtained in the training set should be semi-definite ()

In this way, we obtain the necessary conditions for a kernel function:

K is a valid kernel function ==> the kernel function matrix K is symmetric and semi-definite.

Fortunately, this condition is fully expressed by the Mercer theorem.

MercerTheorem:

If function K is a ing on (that is, ing two n-dimensional vectors to the real number field ). If K is a valid kernel function (also known as the Mercer kernel function), then the corresponding kernel function matrix is symmetric and semi-definite only for the training sample.

The Mercer theorem indicates that in order to prove that K is a valid kernel function, we do not need to look for it. Instead, we only need to find each of them in the training set, then, you can determine whether the matrix K is semi-positive. (use the upper-left corner primary formula to be greater than or equal to zero.

Many other textbooks use the concepts of norm and regenerative Hilbert space in the proof of Mercer theorem, but the proof given here is equivalent when the feature is n-dimensional.

Kernel functions are not only used in SVM, but they appear in a post-model algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.