PCA Dimension Reduction
-- Minimum Variance interpretation (linear algebra see PCA)
Note: According to the online data collation, welcome to discuss
The complexity of the machine learning algorithm is closely related to the dimensionality of the data, even with the dimension number of the exponential association. So we have to dimensionality the data.
dimensionality, of course, means the loss of information, but given the relevance of the actual data itself, we can find ways to minimize the loss of information while reducing dimensionality.
PCA is a dimensionality reduction method with strict mathematical basis and has been widely adopted.
Covariance matrices and optimization targets
If we have to use one dimension to represent this data and want to keep the original information as much as possible, how do you choose?
Through the discussion of the base transformation in the previous section we know that the problem is actually to select a direction in the two-dimensional plane, all the data is projected in the direction of the line, with projected values to represent the original. This is an actual two-dimensional problem that falls to one dimension.
So how do you choose this direction (or base) to keep as much of the original information as possible? An intuitive view is that you want the projected values to be scattered as far as possible.
Variance
As we said above, we want the projected values to be scattered as far as possible, and this dispersion can be expressed in terms of mathematical variance. is formalized as: looking for a wiki so that all the data is transformed to the coordinates on this base, the variance value is the largest.
Covariance
A covariance of 0 o'clock indicates that two fields are completely independent. In order to have a covariance of 0, we select the second base only in the direction of the first Kizheng intersection. Therefore, the final selection of the two directions must be orthogonal.
At this point, we have obtained the optimization goal of dimensionality reduction: A set of n-dimensional vectors is reduced to k-dimensional (k is greater than 0, less than N), the goal is to select K units (modulo 1) orthogonal basis, so that the original data transformed to this set of bases, the 22 covariance between the fields of 0, and the field of the variance is as large as possible (under the orthogonal constraints, The maximum K-variance).
Covariance matrix
The computational scheme is studied mathematically.
we have m n-dimensional data records, which are ranked by column n by M of the Matrix X, set C=1/M*XXT, C is a symmetric matrix, the diagonal of each field of the variance, and I row J column and J row I column elements are the same, the table shows the covariance of the I and J two fields .
Diagonalization of covariance matrices
Based on the above deduction, we find that to achieve optimization at present, it is equivalent to diagonalization of the covariance matrix: that is, except for the diagonal of the other elements to 0, and on the diagonal of the elements by size from top to bottom, so that we achieve the optimization purposes. This may not be very clear, and we look further at the relationship between the original matrix and the matrix covariance matrix after the base transformation:
Set the original data matrix x corresponds to the covariance matrix C, and P is a set of base by row matrix, set y=px, then Y is x to P do base transformation data. With the covariance matrix of y as D, we derive the relationship between D and C:
D = 1/m*yyt= 1/m* (px) (px) t=1/m*pxxtpt =p (1/m*xxt) PT =pcpt
Now things are clear! The p we're looking for is not something else, but a p that can make the original covariance matrix diagonal. In other words, the optimization target becomes the search for a matrix p, satisfies pcpt is a diagonal matrix, and the diagonal elements are arranged from large to small, then the first k line of P is the base to be searched, the matrix of P's former K-line is multiplied by x so that x is reduced from N to K-dimension and satisfies the above optimization condition .
So far, we are only a step away from the "invention" PCA!
Now all the focus is on the diagonalization of the covariance matrix, and sometimes we should be thankful for the mathematician's antecedent, because the matrix diagonalization in the linear algebra field already belongs to the rotten thing, so this is not a problem in mathematics at all.
As we know from the above, the covariance matrix C is a symmetric matrix, and in linear algebra, the real symmetric matrix has a series of very good properties:
1) The characteristic vectors corresponding to the different eigenvalues of the real symmetric matrices must be orthogonal.
2) Set the eigenvector λ-number to R, then there is a certain R-linearly independent eigenvector corresponding to λ, so the R eigenvector units can be orthogonal.
From the above two, we can see that an n-row N-column real symmetric matrix must be able to find n units of orthogonal eigenvector, set the n eigenvector for E1,e2,?, en, we will make it by column matrix:
E= (E1,e2,?en)
The covariance matrix C has the following conclusions:
etce=λ;
where λ is the diagonal matrix, its diagonal elements are the eigenvalues of each feature vector (which may be duplicated).
Here we find that we have found the matrices needed P:p=et
P is a matrix that is arranged by rows after the eigenvectors of the covariance matrix, each of which is a characteristic vector of C. If you set p to the eigenvalues in λ from the large to the small, the eigenvector from top to bottom, then the first K-line matrix of P multiplied by the original data matrix X, we need to get the dimensionality of the data Matrix Y.
Further discussion
PCA essentially takes the direction of the most variance as the main feature, and "correlates" the data in each orthogonal direction, that is, to make them irrelevant in different orthogonal directions.
Therefore, there are some limitations of PCA, such as it can be very good to remove the linear correlation, but there is no way for higher-order correlation, for the existence of higher-order correlation data, can consider kernel PCA, through the kernel function to the nonlinear correlation to the linear correlation, this is not discussed.
In addition, PCA assumes that the main characteristics of the data are distributed in the orthogonal direction, if there are a few deviations in the direction of a large number of variance, the effect of PCA is greatly discounted.
PCA is a way of unsupervised learning.
Minimum variance interpretation (linear algebra see PCA)