Remember the optimization problems mentioned at the end of the previous article:
The inner product needs to be calculated for the solution. If the input sample is linear, the method we take is to map the input sample to another high-dimensional space through function ing and linearly divide it.
Take Cook's law as an example (http://zh.wikipedia.org/zh-cn/%E9%9D%99%E7%94%B5%E5%8A%9B ):
The point charge of one Power acts on the point charge of another power. The size of its static power can be expressed by the equation as follows:
, Which is the distance between two point charge, isEuler's constant.
Obviously, this law cannot be expressed by a linear learner.LnFunction to obtain both sides of the original formLn:
The following figure shows a linear learner:
This process can be described as follows:
In this way, the inner product is changed.
There are two ways to solve this problem:
1. First find the ing, map the samples in the input space to the new space, and finally calculate the Inner Product in the new space;
2. You can find a method that does not need to explicitly map the samples in the input space to the new space, but can directly calculate the Inner Product in the input space.
First, let's look at the first method. Take the polynomial as an example to convert it:
That is to say, after ing the input space from two-dimensional to four-dimensional, the sample is linearly segmented from linear but the direct problem caused by this conversion is that the dimension increases, which means that, first, the computation may become complex. Second, the curse of dimensions may occur. For the learner, the dimension of the feature space may not be calculated, however, its generalization ability (the adaptability of the learner to the data outside the training sample) will greatly decrease with the increase of dimensions, which also violates the "okam Razor ", in the end, the inner product may be unable to be obtained, and thus the transformation advantage will be lost;
Let's look at the second method. It is actually an implicit ing of the input space to the high-dimension space. It does not need to explicitly give that ing, and can be calculated in the input space, this is the legendary kernel function method:
Definition 1: The core is a function. For all the requirements, here is the ing from the inner product feature space.
Therefore, the standard inner product of the input space is promoted.
When is the kernel function?
If there is an input space and it is a symmetric function, the following matrix is obtained for all samples: Obviously, this is a symmetric matrix, so there must be an orthogonal matrix for the symmetric matrix, here is the diagonal matrix containing K feature values. The feature vector corresponding to the feature values is, where N is the number of samples and the input space is mapped as follows:
Therefore, (a matrix composed of feature vectors is a triangular matrix composed of corresponding feature values), that is, K corresponds to the ing kernel function.
Example: \ left [
\ Begin {array} {CCCC}
4 & 0 & 0 \\
0 & 3 & 1 \\
0 & 1 & 3 \\
\ End {array} \ Right
]
"Src =" http://chart.apis.google.com/chart? CHT = TX & chlorophyll = K % 3d % 0d % 0a % 5 cleft % 5B % 0d % 0a % 5 cbegin % 7 barray % 7D % 7 bcccc % 7D % 0d % 0a4% 260% 260% 5C % 5c % 0d % 0a0% 263% 261% 5C % 5c % 0d % 0a0% 261% 263% 5C % 5c % 0d % 0a % 5 Cend % 7 barray % 7D % 5 cright % 0d % 0a % 5d % 0d % 0a ">, the feature vectors are obtained after the basic solution, orthogonal, and unitized for the root 4 of the 2-feature, and the feature vectors are obtained in units, so \ left [
\ Begin {array} {CCCC}
1 & 0 & 0 \\
0 & \ frac {1} {\ SQRT (2)} &-\ frac {1} {\ SQRT (2 )}\\
0 & \ frac {1} {\ SQRT (2)} & \ frac {1} {\ SQRT (2 )}\\
\ End {array} \ Right
] "Src =" http://chart.apis.google.com/chart? CHT = TX & chlorophyll = V % 3d (v_1% 2cv_2% 2cv_3) % 3d % 0d % 0a % 5 cleft % 5B % 0d % 0a % 5 cbegin % 7 barray % 7D % 7 bcccc % 7D % 0d % 0A1% 260% 260% 5C % 5c % 0d % 0a0% 26% 5 cfrac % 7B1% 7D % 7b % 5 csqrt (2) % 7D % 26-% 5 cfrac % 7B1% 7D % 7b % 5 csqrt (2) % 7D % 5c % 5c % 0d % 0a0% 26% 5 cfrac % 7B1% 7D % 7b % 5 csqrt (2) % 7D % 26% 5 cfrac % 7B1% 7D % 7b % 5 csqrt (2) % 7D % 5c % 5c % 0d % 0a % 5 Cend % 7 barray % 7D % 5 cright % 0d % 0a % 5d ">, \ begin {array} {CCCC}
4 & 0 & 0 \\
0 & 4 & 0 \\
0 & 0 & 2 \\
\ End {array} \ Right
] "Src =" http://chart.apis.google.com/chart? CHT = TX & chlorophyll = V % 5E % 7b-1% 7dkv % 3d % 5 clamloud % 3d % 5 cleft % 5B % 0d % 0a % 5 cbegin % 7 barray % 7D % 7 bcccc % 7D % 0d % 0a4% 260% 260% 5C % 5c % 0d % 0a0% 264% 260% 5C % 5c % 0d % 0a0% 260% 262% 5C % 5c % 0d % 0a % 5 Cend % 7 barray % 7D % 5 cright % 0d % 0a % 5d ">, map all input samples as follows:
;
;
.
Select either of them as the inner product, for example.
It can be seen that the kernel function corresponding to the feature ing is the following conclusion:
Theorem 1: There is a finite input space, which is a symmetric function on the kernel function. The necessary and sufficient condition for the kernel function is the matrix semi-definite. In this case, the input space is directed to the feature space.Implicit. For the above ing, so, and then.
Theorem 3: A tight subset (closed and bounded subset) is a symmetric function on which the integral operator in the Hilbert Space satisfies:
This refers to the space composed of all functions that meet the conditions.
It is expanded to a series with consistent convergence. The sequence consists of feature functions, which are normalized and all feature values can be decomposed:
There is a very important concept in the core method (marking first ):
Definition 2: It is a Hilbert function space. Its elements are real-value or complex-value functions in an abstract set. If any function is a medium element and any and inner product is:
Is calledRegenerative kernel Hilbert Space(Reproducing kernel Hilbert space, rkhs );Regeneration Core(Rk ).
Theorem 4: For each core defined on the domain, there is a regeneration Hilbert space defined on the function, where it is the regeneration kernel. In turn, for any Hilbert space of Linear Bounded Functions, the existence of the regenerative kernel is also true.
Of course, you can also use the kernel function to construct the kernel function. Sometimes this structure can effectively solve the problem:
Condition: it is set to an upper core, a real-value function on, an upper core, and a symmetric semi-definite matrix. The following functions are all cores:
1 ,;
2 ,;
3 ,;
4 ,;
5 ,;
6 ,.
The selection of cores is crucial for SVM. After selecting a kernel, the original problem becomes:
Is there an optimal solution for this optimization problem? Remember that the kernel must meet the Mercer condition, that is, the matrix is in the upper half of all training sets. This indicates that this optimization is convex optimization, so this condition ensures that the maximum interval optimization problem has a unique solution, it's just a combination of heaven and earth. The final sum is the perfect combination of heaven and earth. Then the maximum interval hyperplane obtained after implicit ing from the input space to the feature space will come out :, and there is a geometric interval {(\ sum \ limits _ {I \ In support \ quad vector} \ Alph ^ * _ I)} ^ {1/2} "src =" http://chart.apis.google.com/chart? CHT = TX & chlorophyll = % 5 cgamma % 3d % 5 cfrac % 7B1% 7D % 7b % 7C % 7cw % 5E * % 7C % 7C % 7D % 3d % 0d % 0a % 7B (% 5 csum % 5climits _ % 7bi + % 5cin + support + % 5 cquad + vector % 7D % 5 calph % 5E * _ I) % 7D % 5E % 7B1% 2f2% 7D ">.
Common core functions are summarized as follows:
Linear kernel functions:
Polynomial kernel functions:
Gaussian Kernel function:
Core functions:
The following link collects several core functions:
Http://www.shamoxia.com/html/y2010/2292.html
The theory of the core method involves functional analysis, calculus, and so on. I recommend a book :《Kernel Methods for Pattern AnalysisBy John shawe-Taylor and Nello cristianini.