[Machine Learning Notes] Introduction to PCA and Python implementations

Last Update:2016-03-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PCA (principal component analysis) is a common data dimensionality reduction method, which aims to reduce the computational amount by converting high-dimensional data to lower dimensions under the premise of less "information" loss. The "information" here refers to the useful information contained in the data.

Main idea: From the original characteristics of a set of "importance" from the large to the small arrangement of new features, they are the original characteristics of the linear combination (or they are the original features in a certain direction of mapping, linear combination is a number of features multiplied by a coefficient, in a direction of the map is equivalent to each characteristic and the direction of the inner product, Is the same as the truth) and are unrelated to each other.

Therefore, the key point is:1. What is the "importance" of a feature? How to measure? 2. How many dimensions will the original data be reduced to ensure less "information" is lost?

1. Measurement of the "importance" of data

The "importance" of the data refers to the variance of the sample after the feature transformation. The larger the variance, the greater the difference in the characteristics of the sample, and the more important the feature is. In the "Machine learning actual combat" in the figure description.

There are 3 categories of data in the diagram, and it is clear that the larger the variance, the easier it is to separate the different categories of points. The projected variance of the sample on the x-axis is large, and the projected variance on the y-axis is small. The direction of the maximum variance should be the direction of the middle oblique upward. If the sample is mapped in the direction of the middle oblique upward, then as long as one-dimensional data can be classified, compared to the original data, it is quite a one-dimensional reduction.

In the case of the original data more multidimensional, first get a data transform the direction of the largest difference, and then select the direction orthogonal to the first direction, the direction is the variance of the direction of the large, so go on, Until a new feature with the same number of original features is transformed or the first n features are transformed (the first n features contain most of the data), in short, the PCA is a dimensionality reduction process that maps the data to new features, the new feature being the linear combination of the original features.

2. calculation process ( because the insertion formula is troublesome, it is directly used in the way )

3.python Implement

 fromNumPyImport*" "it is appropriate to calculate how much data is dropped by the percentage of variance, the parameters passed in by the function are eigenvalues and percentages percentage, and the number of dimensions to be dropped is returned NUM" "defeigvalpct (eigvals,percentage):#Use the sort () in numpy to sort the eigenvalues by small to largesortarray=sort (eigvals)#sort the eigenvalues from large to smallSortarray=sortarray[-1::-1]    #the total variance of the data arraysumarraysum=sum (sortarray) TempSum=0 Num=0 forIinchSortarray:tempsum+=I num+=Iiftempsum>=arraysum*percentage:returnNum" "The PCA function has two parameters, where Datamat is a dataset that has been converted to matrix matrices, and the column represents the feature, where the percentage represents the variance ratio of the number of features required before taking the default to 0.9" "defPCA (datamat,percentage=0.9):    #averaging for each column, because the mean value is subtracted from the calculation of the covarianceMeanvals=mean (datamat,axis=0) meanremoved=datamat-meanvals#CoV () Calculating varianceCovmat=cov (meanremoved,rowvar=0)#using the Eig () method in the module linalg for finding eigenvalues and eigenvectors in NumPyeigvals,eigvects=Linalg.eig (Mat (Covmat))#to reach a percentage of variance percentage, the first k vectors are requiredk=eigvalpct (eigvals,percentage)#to sort the eigenvalues eigvals from small to largeEigvalind=Argsort (eigvals)#From the sequence of eigenvalues, from the back to fetch K, so that the characteristics of the value of the large-to-small arrangementeigvalind=eigvalind[:-(k+1): 1]    #returns the characteristic vector Redeigvects (principal component) corresponding to the characteristic value after sortingredeigvects=Eigvects[:,eigvalind]#The original data is projected onto the main component to get new low-dimensional data Lowddatamatlowddatamat=meanremoved*redeigvects#get the refactoring data ReconmatReconmat= (LOWDDATAMAT*REDEIGVECTS.T) +meanvalsreturnLowddatamat,reconmat

[Machine Learning Notes] Introduction to PCA and Python implementations

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Machine Learning Notes] Introduction to PCA and Python implementations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Machine Learning Notes] Introduction to PCA and Python implementations

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support