Support Vector Machine (iii) kernel functions

Last Update:2018-12-07 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

7-core function (kernels)

Considering the problem we raised in "linear regression", the feature is the area X of the house, where X is a real number, and Y is the price of the house. Assuming that X and Y conform to three curves from the distribution of sample points, we want to use the cubic polynomials of X to approach these sample points. First, we need to extend feature X to three dimensions, and then look for a model between the feature and the result. We call this Feature Transform Feature Mapping ). The ing function is called. In this example

We hope to apply the features after feature ing to SVM classification, rather than the initial features. In this way, we need to map the Inner Product in the preceding formula.

As to why we need to map features instead of the original features for calculation, the above mentioned (to better fit) is one of the reasons, another important reason is that there may be linear inseparable situations in the example. After a feature is mapped to a high-dimensional space, it can be divided. (In the introduction to data mining, Pang-ning tan and others, the chapter "Support Vector Machine" provides a good example)

Formally defines the kernel function. If the Inner Product of the original feature is mapped to the kernel function, the kernel function is defined

Here, we can draw a conclusion that if we want to achieve the effect at the beginning of this section, we only need to calculate it first and then calculate it. However, this calculation method is very inefficient. For example, if the initial feature is n-dimensional, we map it to a dimension and then calculate the time required. So can we try to reduce the computing time?

Let's look at an example. Assume that X and Z are both N-dimensional,

After expansion

In this case, we can only calculate the square of the original feature X and Z Inner Product (time complexity is O (n), which is equivalent to the Inner Product of the feature after the ing. That is to say, we don't need to spend time.

Now let's take a look at the ing function (n = 3). Based on the above formula, we get

That is to say, the kernel function can be equivalent to the Inner Product of the ing feature only when such a ing function is selected.

Let's look at another core function.

The corresponding ing function (when n = 3) is

Generally, the feature dimension after the ing of the core function is. (For the solution method, see the http://zhidao.baidu.com/question/16706714.html ).

Because we calculate the inner product, we can think of the cosine similarity in IR. If the angle between X and Z is smaller, the greater the value of the kernel function, and vice versa, the smaller the value. Therefore, the value of the core function is similar to that of the kernel function.

Let's look at another core function.

At this time, if X and Z are very similar (), then the core function value is 1. If X and Z differ greatly (), then the core function value is approximately 0. Because this function is similar to Gaussian distribution, it is also called a Gaussian Kernel Function (RBF ). It can map original features to infinite dimensions.

Since the Gaussian Kernel function can compare the similarity between X and Z and map it to 0 to 1, we can recall the logistic regression, the sigmoid function can, so there are also sigmoid kernel functions and so on.

The following figure shows that, when low-dimensional linearity is not available, it can be divided after ing to high-dimensional. Gaussian Kernel function is used.

Slides from Eric Xing

Note: after using the kernel function, how can we classify new samples? In linear mode, we use SVM to learn W and B. If the new sample X is used, we use SVM to judge that if the value is greater than or equal to 1, it is a positive class and less than a negative class. The two are considered uncertain. If the kernel function is used, it becomes. Do you need to find it before prediction? The answer is definitely no. It's very troublesome. Let's look back at what we said before.

You only need to replace the value with the same value.

8-core function validity Determination

Q: Given a function K, can we use K to replace computing? That is to say, can we find one so that all X and Z have one?

For example, whether K can be considered as a valid kernel function.

Next we will solve this problem. Given M training samples, each of which corresponds to a feature vector. Then, we can bring any two sums into K for calculation. I can be from 1 to M, J can be from 1 to M, so that the kernel function matrix (kernel matrix) of M * m can be calculated ). For convenience, we use the kernel function matrix and K for representation.

If K is a valid core function, it is defined according to the core function.

It can be seen that matrix K is a symmetric array. Let's draw a stronger conclusion. First, we use a symbol to represent the K-dimensional attribute value of the ing function. For any vector z

The last step is similar to the previous calculation. From this formula, we can see that if K is a valid kernel function (I .e. equivalent), then the kernel function matrix K obtained in the training set should be semi-definite ()

In this way, we obtain the necessary conditions for a kernel function:

K is a valid kernel function ==> the kernel function matrix K is symmetric and semi-definite.

Fortunately, this condition is fully expressed by the Mercer theorem.

MercerTheorem:

If function K is a ing on (that is, ing two n-dimensional vectors to the real number field ). If K is a valid kernel function (also known as the Mercer kernel function), then the corresponding kernel function matrix is symmetric and semi-definite only for the training sample.

The Mercer theorem indicates that in order to prove that K is a valid kernel function, we do not need to look for it. Instead, we only need to find each of them in the training set, then, you can determine whether the matrix K is semi-positive. (use the upper-left corner primary formula to be greater than or equal to zero.

Many other textbooks use the concepts of norm and regenerative Hilbert space in the proof of Mercer theorem, but the proof given here is equivalent when the feature is n-dimensional.

Kernel functions are not only used on SVM, but after a modelAlgorithmWe can always use replacement, which may improve our algorithm.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More