dimensionality reduction (i)----the source of principal component analysis (PCA)
Reduced Dimension Series:
- dimensionality reduction (i)----the source of principal component analysis (PCA)
- dimensionality Reduction (ii)----Laplacian Eigenmaps
---------------------
Principal component Analysis (PCA) is introduced in many tutorials, but why is the principal component of the data derived from the eigenvalue decomposition of the covariance matrix? Why the covariance matrix and eigenvalues are so magical, I haven't figured it out. Today finally the whole process of sorting out, to facilitate their own learning, but also to communicate with you.
In the case of two-dimensional features, there may be a linear relationship between the two features (for example, the two characteristics are movement speed and second velocity respectively), which causes the second dimension information to be redundant. The goal of PCA is to detect the linear relationship between the features, to identify these linear relationships, and to remove this linear relationship.
Or a two-dimensional feature, for example. There may not be a complete linear relationship between features, which may be just strong positive correlation. If the X-y coordinate is decomposed into u1-u2 coordinates, and the U1 axis reflects the main change of the feature (intrinsic), and the U2 characteristic change is small, it can be fully understood as some noise disturbance without considering it. The task of PCA is to find U1 and U2.
The mean values of each dimension feature are centered and the variance normalized.
- The mathematical goal of PCA:
The main direction of the feature is the direction in which the characteristic amplitude changes most ("major axis of variation"). It is important to understand this point. From the opposite side, the direction of the smallest amplitude change is no change, or very very small changes (can be ignored changes), relative to the value of the smallest, most can be ignored. In order to find the direction of the most characteristic change, assuming that the unit direction vector is U, then the feature point x in the U-direction projection point x ' distance from the origin is D=xtu (the first error was written in d=x, thanks to @zsfcg's message, has now been corrected). After all the sample points are projected in one direction, they are all in the same line. and to compare the degree of change between them, just compare the variance of D on the line . The direction of the most variance u corresponds to the main direction we are looking for (that is, the PCA goal is to find the direction of large variance). As a result, our objective function becomes:
(1)
where X's superscript I represents the first I sample in the DataSet, M represents the total number of samples in the dataset. (because X is already centered, so Xu's mean value is also 0, so the sum of the Xu's squares is the variance.) )
One of the parentheses is very familiar with covariance matrix σ! We finally know how the covariance matrix came into being. Looking at the above equation, the covariance matrix is independent of the direction of the projection and is related to the sample in the dataset, so the covariance matrix completely determines the distribution and variation of the data (please distinguish it from the autocorrelation matrix).
The objective function is as follows:
(2)
Using Lagrange multiplier method to solve the maximization problem, it is easy to obtain:
(3)
Did you see that? U is the eigenvector of σ, and λ is the characteristic value. We then put (3) into (2), the objective function becomes
(4)
visible, the size of the variance can be measured by the traces of the covariance matrix. the maximum eigenvalue λ (and the corresponding eigenvector u) determines the direction in which the data changes most. U is the direction of this unit. So the process of PCA is the eigenvalue decomposition of covariance matrix, and the process of finding the largest number of eigenvalues.
The subsequent process is composed of the largest k eigenvalues corresponding to a set of new base (basis), so that the original feature of the new base projection has been reduced to a new feature. This process is a lot of tutorials are introduced very clearly, I do not describe.
But I would like to add the meaning of the matrix and its eigenvalues. The matrix should be understood as a spatial transformation (a transformation from one space to another). The Matrix M is an MXN dimension, and if the m=n is transformed, the spatial dimension will not change, if the n<m is reduced dimension. We consider the m=n square. The eigenvalue decomposition of M, a new set of bases can be understood as a new set of base coordinates in m-dimensional space (it is introduced because this set of base coordinates is a more efficient way to express the RM dimension space), and the corresponding eigenvalues, that is, the corresponding dimension of projection in the base coordinates of the degree of response. If the characteristic value is 0, the representation does not correspond in this dimension, no matter how much, multiply is 0. If the eigenvalues are close to 0, then any number projected in this dimension will become much smaller. Therefore, from the perspective of dimensionality reduction, this dimension can be ignored (i.e., the purpose of preserving the first K eigenvalues in PCA). Singular value decomposition is also the same principle.
PCA is actually one of the simplest dimensionality reduction methods, and the obvious disadvantage is that it only removes linear correlations between the data. The improvement of linearity is often extended to non-linear applications through kernel technology. In addition, this dimensionality reduction of PCA does not necessarily contribute to classification, and one of the dimensionality reduction methods used for classification is LDA. On the other hand, PCA is a linear projection, preserving the European distance between data and data, that is, the original Euclidean distance of two points in the dimension of the space after the distance should also be large (so that the variance is great). in fact, the data may present some kind of flow structure, and the data will be unable to maintain the original flow structure after using PCA to reduce the dimension . The nonlinear dimensionality reduction methods commonly used in this regard are locally linear embedding and Laplacian eigenmaps, as shown in:
Another way of derivation of PCA is to minimize the loss after projection (the reduced dimension is understood as compression, the error is minimized after compression), in this article also has the specific introduction, I also did not say much.
Writing here, only to find that I did not say anything, is to provide a variety of literature links.
In addition, a deeper understanding of eigenvalues and eigenvectors can be seen in this article.
--------------------
jiang1st2010
Original address: http://blog.csdn.net/jiang1st2010/article/details/8935219
dimensionality reduction (i)----the source of principal component analysis (PCA)