PCA (principal component analysis) is a common data dimensionality reduction method, which aims to reduce the computational amount by converting high-dimensional data to lower dimensions under the premise of less "information" loss. The "information" here refers to the useful information contained in the data.
Main idea: From the original characteristics of a set of "importance" from the large to the small arrangement of new features, they are the original characteristics of the linear combination (or they are the original features in a certain direction of mapping, linear combination is a number of features multiplied by a coefficient, in a direction of the map is equivalent to each characteristic and the direction of the inner product, Is the same as the truth) and are unrelated to each other.
Therefore, the key point is:1. What is the "importance" of a feature? How to measure? 2. How many dimensions will the original data be reduced to ensure less "information" is lost?
1. Measurement of the "importance" of data
The "importance" of the data refers to the variance of the sample after the feature transformation. The larger the variance, the greater the difference in the characteristics of the sample, and the more important the feature is. In the "Machine learning actual combat" in the figure description.
There are 3 categories of data in the diagram, and it is clear that the larger the variance, the easier it is to separate the different categories of points. The projected variance of the sample on the x-axis is large, and the projected variance on the y-axis is small. The direction of the maximum variance should be the direction of the middle oblique upward. If the sample is mapped in the direction of the middle oblique upward, then as long as one-dimensional data can be classified, compared to the original data, it is quite a one-dimensional reduction.
In the case of the original data more multidimensional, first get a data transform the direction of the largest difference, and then select the direction orthogonal to the first direction, the direction is the variance of the direction of the large, so go on, Until a new feature with the same number of original features is transformed or the first n features are transformed (the first n features contain most of the data), in short, the PCA is a dimensionality reduction process that maps the data to new features, the new feature being the linear combination of the original features.
2. calculation process ( because the insertion formula is troublesome, it is directly used in the way )
3.python Implement
fromNumPyImport*" "it is appropriate to calculate how much data is dropped by the percentage of variance, the parameters passed in by the function are eigenvalues and percentages percentage, and the number of dimensions to be dropped is returned NUM" "defeigvalpct (eigvals,percentage):#Use the sort () in numpy to sort the eigenvalues by small to largesortarray=sort (eigvals)#sort the eigenvalues from large to smallSortarray=sortarray[-1::-1] #the total variance of the data arraysumarraysum=sum (sortarray) TempSum=0 Num=0 forIinchSortarray:tempsum+=I num+=Iiftempsum>=arraysum*percentage:returnNum" "The PCA function has two parameters, where Datamat is a dataset that has been converted to matrix matrices, and the column represents the feature, where the percentage represents the variance ratio of the number of features required before taking the default to 0.9" "defPCA (datamat,percentage=0.9): #averaging for each column, because the mean value is subtracted from the calculation of the covarianceMeanvals=mean (datamat,axis=0) meanremoved=datamat-meanvals#CoV () Calculating varianceCovmat=cov (meanremoved,rowvar=0)#using the Eig () method in the module linalg for finding eigenvalues and eigenvectors in NumPyeigvals,eigvects=Linalg.eig (Mat (Covmat))#to reach a percentage of variance percentage, the first k vectors are requiredk=eigvalpct (eigvals,percentage)#to sort the eigenvalues eigvals from small to largeEigvalind=Argsort (eigvals)#From the sequence of eigenvalues, from the back to fetch K, so that the characteristics of the value of the large-to-small arrangementeigvalind=eigvalind[:-(k+1): 1] #returns the characteristic vector Redeigvects (principal component) corresponding to the characteristic value after sortingredeigvects=Eigvects[:,eigvalind]#The original data is projected onto the main component to get new low-dimensional data Lowddatamatlowddatamat=meanremoved*redeigvects#get the refactoring data ReconmatReconmat= (LOWDDATAMAT*REDEIGVECTS.T) +meanvalsreturnLowddatamat,reconmat
[Machine Learning Notes] Introduction to PCA and Python implementations