Many machine learning algorithms have one hypothesis: input data is linearly divided. The perceptron algorithm must be convergent for completely linearly-divided data. Considering the noise, Adalien, logistic regression, and SVM do not require the data to be completely linearly divided.
But there are a lot of non-linear data in real life, and the linear conversion methods such as PCA and LDA are not very good at this time. In this section we learn about the nuclear version of PCA, nuclear PCA. The "nucleus" here is similar to the nuclear SVM. Using the kernel PCA, we can transform non-linear data into new and low-dimensional feature subspace, and then solve it by using a linear classifier.
Nuclear functions and nuclear techniques
Remember that in the nuclear SVM, we talked about solving nonlinear problems by mapping them to new high-dimensional feature spaces, where the data is linearly divided in high-dimensional spaces. To map the data to the high dimensional k-space, we define a nonlinear mapping function:
We can understand the function of nuclear function as: By creating some nonlinear combination of the original features, then mapping the original D-dimension data set to the K-dimensional feature space, d<k. For example, eigenvectors, x is a column vector containing D features, d=2, which can be mapped to 3-dimensional feature spaces according to the following rules:
The working mechanism of the same kernel PCA: through the non-linear mapping of kernel PCA, the data is transformed into a high dimension space, then using standard PCA in this high dimension space to re-map the data to a lower than the original space, and finally can solve the problem with a linear classifier. However, this approach involves two mapping transformations, and the computational cost is very high, which leads to the nuclear technique (kernel trick).
Using nuclear techniques, we can directly calculate the similarity of two high-dimensional eigenvectors in the original feature space (no need for feature mapping and then similarity).
Before introducing nuclear techniques, let's review the practice of standard PCA. We calculate the covariance of two features K and J according to the following formula:
Since we have normalized the data, the characteristic average is 0 and the above formula is equivalent to:
Again, we can get the covariance matrix:
Bernhard Scholkopf (B. Scholkopf, A.smola, and K.R. Muller. Kernel Principal Componentanalysis. Pages 583-588, 1997) The generalized form of the upper formula is obtained, and the point multiplication between the two samples of the original dataset is replaced by the nonlinear feature combination:
In order to get the eigenvector (principal component) from the covariance matrix, we must solve the following equation:
Among them, the eigenvalues and eigenvectors of the covariance matrix are found in the following paragraphs.
We solve the kernel matrix:
First, we write the matrix form of the covariance matrix, which is a n*k matrix:
We write feature vectors:
Due to the need to:
On both sides of the equation left multiply:
Here is the similarity (nucleus) Matrix:
Recall kernel SVM We use nuclear techniques to avoid direct computation:
The kernel PCA also does not need to build a transformation matrix like standard PCA, and we use kernel functions instead of calculations. So, you can think of the kernel function (short, kernel) as the function of calculating two vector points, and the result can be regarded as the similarity of two vectors.
The commonly used kernel functions are:
, which is the threshold value, is the index set by the user.
- Hyperbolic tangent (sigmoid) Cores:
- Radial basis function core (Gaussian core):
Now summarize the steps of the nuclear PCA, taking the RBF nucleus as an example:
1 compute the kernel (similarity) matrix K, which is the calculation of any two training samples:
Get K:
For example, if the training set has 100 samples, the dimension of the symmetric kernel matrix K is 100*100.
2 central processing of the kernel matrix K:
Among them, is the n*n matrix, the n= training set sample number, each element is equal.
3 The eigenvalues of the computed values, taking the maximum k eigenvalues corresponding to the eigenvector. Unlike standard PCA, the eigenvector here is not the main component axis.
2nd step why calculate? Because in PCA we always deal with standardized data, that is, the average of the features is 0. When we use the nonlinear feature combination instead of the point multiplication, we do not show that the new feature space is not standardized in the new feature space, we can not guarantee the characteristic mean value of the new feature space is 0, so we should focus on K.
Python Machine learning Chinese catalog (http://www.aibbt.com/a/20787.html)
Reprint please specify the source, Python machine learning (http://www.aibbt.com/a/pythonmachinelearning/)
Python machine learning: 5.6 Using kernel PCA for nonlinear mapping