Principle of principal component analysis and its implementation by Python

Last Update:2015-05-28 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Principle of principal component analysis and its Python implementation preface:

This article mainly refers to Andrew Ng's machine learning course handout, I translated, and with a Python demo demo to deepen understanding.

This paper mainly introduces a dimensionality reduction algorithm, principal component analysis method, Principal components analyses, referred to as PCA, the goal of this method is to find a sub-space of the approximate concentration of data, as to how to find this subspace, the following will give a detailed introduction, PCA is more direct than other dimensionality reduction algorithms, and only one feature vector can be computed. (in Matlab,python,r this can be easily implemented with the EIG () function).
Suppose we give a dataset that represents m vehicles of different kinds, each of which represents the maximum speed, turning radius, etc. of the vehicle, assuming that it has (n << m) for each I but one thing we don't know, for some XI and XJ, the maximum speed is given in different units. , one is to use Mile/hour, one is to use the Kilometers/hour, these two properties are obviously linearly related, as for some difference is caused by the conversion between mph and kph, so the data is actually approximately concentrated in a n-1 dimension of the subspace, How do we automatically detect and remove these redundancies?

To give a more specific example, we now have a data set by a survey of the pilots of the RC helicopter, the first component of the number of points on the flight represents the pilot's flying skills, and the second component represents the degree to which TA is enjoying the flight, because the RC helicopters are very difficult to manipulate, So only those who really enjoy the process can become a good pilot, so, these two attributes are very strong correlation, in fact, we can assume that the data is actually concentrated in the look like a diagonal axis, we can make this axis intuitively understood as a person and flight "fate", (This intuitive understanding can help to understand PCA), as shown in how do we automatically calculate the direction of U1?

Yes, the method of calculating the direction of U1 is the PCA algorithm, but before running the PCA algorithm on the data set, we usually first preprocess the data to standardize the data's mean and variance. Here is the preprocessing process:

Step 1-2 is to let the average value of the entire data set is 0, if the average value of this dataset is already 0, then these two steps can be ignored, step 3-4 updates the data of each property so that the data on the same property has a unit variance (normalized), so that different attributes have the same data range and be "treated equally" , for example, if the property X1 represents the car's fastest speed, in mph, the data range is usually dozens of to more than 100, X2 represents the number of seats in the car, the data range is usually 2-4, it is necessary to re-standardize the two different properties to make them more comparable, of course, If we have learned that different attributes have the same data size, step 3-4 can be ignored, such as when each data point represents a grayscale image, at which point each is taken from {0,1......,255}.

Now that we have performed the standardization of the data, the next step is to calculate the "main data difference" Axis u, that is, the direction of the main concentration of data, the way to solve this problem is to find a unit vector u so that when the data is projected into this direction, the variance of the projection data is maximized, intuitively understand, The data set contains the variance and data information of the data, we should choose a direction u so that we can approximate the data in the direction or subspace represented by U, and keep the variance of the data as much as possible.

Consider the following data collection, which we have implemented for the normalization steps:

Now, suppose we elect a direction in the U representation, and the black point represents the projection of the original data on this line,

In, we see that the projection data still has a fairly large variance, and these points are far away from the origin, so let's look at another scenario, assuming that the chosen U is in this direction:

In this diagram, it is obvious that the variance of the projection is significantly smaller and the projection data is closer to the origin.
So, we want to use some sort of algorithm to pick out the direction u corresponds to the former in the above two graphs, in order to describe in the standard mathematical language, assuming that there is a unit vector u and a point x, the X projection on the length of the U is equivalent to the argument that if Xi is a point in an existing dataset (represented by a fork in the , then its projection on U (The black dots on the graph) is at the distance from the origin, so in order to maximize the variance of the projection, we need to pick out a unit vector to make the following equation get the maximum value:

We can easily find that when the above equation takes the maximum value, you take the main eigenvector of the Matrix, and by observing it, we find that the matrix is exactly the covariance matrix of the original data set.
To summarize, we find that if we want to find a one-dimensional subspace to approximate the data set, we only need to calculate the main eigenvector u to generalize to the high-dimensional space, if we project our datasets into K-SubSpace space (k< N), We should use the K-large eigenvector of the matrix as the U1,u2,..... UK, which means that the UI now consists of a new set of bases on the dataset. After that, in order to represent Xi's coordinates in this new set of bases, we just need to calculate the corresponding data vectors so that the original data vector Yi is now located in a lower than n-dimensional, K-dim subspace, used to approximate or completely replace XI, so PCA is also called a dimensionality reduction algorithm, vector u1,u2, ... UK is called the first K principal component of a dataset.
PCA has a lot of application scenarios, first of all, compression is an obvious use-yi instead of Xi, if we reduce the high-dimensional data set to k=2 or 3-dimensional, we can draw Yi to make data visualization, for example, if we reduce the data set to two-dimensional, then we can draw the data into the coordinate plane (a single point in the figure represents a type of vehicle), and then observe which models are similar and which models can be clustered.

Other standard uses include reducing the dimensionality of the data set before running the supervised learning algorithm, in addition to reducing the computational overhead, and reducing the dimension of the data to reduce the complexity of the hypothetical classification while avoiding overfitting.

Here is a python implementation of the PCA, in order to visually describe the PCA algorithm, using a 1000*2 matrix as a test data set, down to 1 dimensions, and draw an image to observe, first look at our test data file The first 10 data points, (if you need this test data set, Can be sent to you. ）

After that we run the PCA algorithm implemented using Python and draw the code as follows:

 fromNumPyImport* def loaddataset(filename,delim=' \ t '):Fr=open (FileName) Stringarr=[line.strip (). Split (Delim) forLineinchFr.readlines ()] Datarr=[map (float,line) forLineinchStringarr]returnMat (Datarr) def PCA(datamat,topnfeat=9999999):Meanvals=mean (datamat,axis=0) Meanremoved=datamat-meanvals Covmat=cov (meanremoved,rowvar=0) Eigvals,eigvets=linalg.eig (Mat (Covmat)) Eigvalind=argsort (eigvals) eigvalind=eigvalind[:-(topNfeat+1):-1] Redeigvects=eigvets[:,eigvalind]PrintMeanremovedPrintRedeigvects lowddatmat=meanremoved*redeigvects reconmat= (LOWDDATMAT*REDEIGVECTS.T) +meanValsreturnLowddatmat,reconmatdatamat=loaddataset (' TestSet.txt ') LOWDMAT,RECONMAT=PCA (Datamat,1) def plotpca(datamat,reconmat):    ImportMatplotlibImportMatplotlib.pyplot asPLT Datarr=array (Datamat) Reconarr=array (Reconmat) N1=shape (Datarr) [0] N2=shape (Reconarr) [0] xcord1=[];ycord1=[] xcord2=[];ycord2=[] forIinchRange (n1): Xcord1.append (Datarr[i,0]); Ycord1.append (Datarr[i,1]) forIinchRange (n2): Xcord2.append (Reconarr[i,0]); Ycord2.append (Reconarr[i,1]) fig=plt.figure () Ax=fig.add_subplot (111) Ax.scatter (xcord1,ycord1,s= -, c=' Red ', marker=' ^ ') Ax.scatter (xcord2,ycord2,s= -, c=' Yellow ', marker=' O ') Plt.title (' PCA ') Plt.show () PLOTPCA (Datamat,reconmat)

The final dimensionality reduction results

Principle of principal component analysis and its implementation by Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More