NG Lesson 12th: Support Vector machines (SVM machines) (ii)

Last Update:2017-05-12 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

7 kernel function (kernels)

Consider the problem we originally raised in "Linear regression", characterized by the size of the house x, where x is the real number, and the result y is the price of the house. Suppose we see that X and Y meet 3 curves from the distribution of the sample points, then we want to approximate the sample points using the three-time polynomial of x. First, you need to extend the feature X to three dimensions, and then look for the model between the feature and the result. We call this feature transformation a feature map (feature mapping). The mapping function is called, in this case

We want to apply the resulting feature mapping feature to the SVM classification, rather than the original feature. In this way, we need to map the inner product from the previous formula to the.

As for why it is necessary to participate in the calculation of the mapped feature rather than the original feature, the above mentioned (for better fitting) is one of the reasons, another important reason is that the sample can be linearly irreducible, and the feature is often divided after mapping it to a high-dimensional space. (In the Introduction to data mining, pang-ning Tan and other people's "Support Vector Machine" Chapter has a very good example of the explanation)

Define the kernel function formally, if the original feature inner product is, after mapping, then define the kernel function (Kernel) as

Here, we can conclude that if you want to achieve the effect of the beginning of the section, you just need to calculate it and then calculate it, but this calculation is very inefficient. For example, the initial feature is n-dimensional, and we map it to dimensions and then calculate it, which takes time. So can we find a way to reduce the computational time?

Let's look at an example, assuming that x and Z are all n-dimensional,

Expand, you have to

At this point we can only calculate the square of the original features x and z inner product (the time complexity is O (n)), which is equivalent to calculating the inner product of the mapped feature. That means we don't have to take the time.

Now look at the mapping function (n=3), according to the above formula, get

This means that the kernel function can be equivalent to the inner product of the mapped feature only when it is selected as a mapping function.

And look at a kernel function.

The corresponding mapping function (when n=3) is

More generally, the mapping feature dimension for the kernel function corresponds to. (See http://zhidao.baidu.com/question/16706714.html for Solution methods).

Due to the calculation of the inner product, we can think of the cosine similarity in IR, if the x and Z vectors the smaller the angle, then the greater the value of the nuclear function, conversely, the smaller. Therefore, the kernel function value is the similarity degree.

And look at another kernel function.

At this point, if X and Z are very similar (), then the kernel function value is 1, if x and Z are very large (), then the kernel function value is approximately equal to 0. Since this function is similar to the Gaussian distribution, it is called the Gaussian kernel function, also called the radial basis function (Radial Basis functions abbreviation RBF). It can map primitive features to infinite dimensions.

Since the Gaussian kernel function can compare the similarity of x and z, and map to 0 to 1, recall logistic regression, sigmoid function, so there are sigmoid kernel functions and so on.

Here is a diagram of the low-dimensional linear non-tick, mapped to a higher dimension can be divided, using the Gaussian kernel function.

From Eric Xing's slides.

Notice how the new sample is sorted after the kernel function is used. Linear when we use SVM to learn W and B, the new sample x, we use to judge, if the value is greater than or equal to 1, then is a positive class, less than equals is a negative class. Between the two, it is not possible to determine. If a kernel function is used, it becomes, is it first to be found, and then predicted? The answer is no, it's a hassle, think back to what we said before

Just replace and then the value is judged as above.

8 Kernel function Validity determination

Question: Given a function k, can we use K instead of calculation, and say, can we find one so that for all x and z?

For example, it is possible to think that K is an effective kernel function.

The following solves this problem, given a M training sample, each corresponding to a eigenvector. So, we can calculate any two and bring it into the K. I can be from 1 to M,j can be from 1 to M, so you can calculate the kernel function matrix of m*m (Kernel matrix). For convenience, we represent both the kernel matrix and the use of K.

If you assume that K is a valid core function, then you define it according to the kernel function

As can be seen, the matrix K should be a symmetric array. Let's draw a stronger conclusion by first using symbols to represent the K-dimensional attribute values of the mapping function. So for any vector z,

The final step is similar to the previous calculation. From this formula we can see that if K is a valid kernel function (i.e. and equivalence), then the kernel function matrix K obtained on the training set should be semi-definite ()

So we get the necessary conditions for a kernel function:

K is an effective kernel function ==> kernel function matrix K is symmetric semi-definite.

Fortunately, this condition is also sufficient, expressed by the Mercer theorem.

Mercer theorem:

If the function k is a map on (that is, from two n-dimensional vectors mapped to real fields). So if K is an effective kernel function (also known as a Mercer kernel function), then if and only for the training sample, its corresponding kernel function matrix is symmetric semi-definite.

Mercer theorem shows that in order to prove that K is an effective kernel function, then we do not need to look for, but only in the training set to find each, and then determine whether the matrix K is semi-positive (using the upper left corner of the principal type is greater than or equal to zero) can be.

Many other textbooks use the concepts of norm and regenerated Hilbert space in the process of Mercer theorem proving, but the proof is equivalent in the case that the characteristic is N dimension.

Kernel functions are not only used in SVM, but in a model post-algorithm, we can often use to replace, which may well improve our algorithm.

NG Lesson 12th: Support Vector machines (SVM machines) (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More