[Machine Learning Algorithm Implementation] Principal Component Analysis (PCA)-based on python + numpy, pcanumpy

Source: Internet
Author: User

[Machine Learning Algorithm Implementation] Principal Component Analysis (PCA)-based on python + numpy, pcanumpy
[Machine Learning Algorithm Implementation] Principal Component Analysis (PCA)-based on python + numpy

@ Author: wepon
@ Blog: http://blog.csdn.net/u012162613/article/details/42177327


1. Introduction to PCA Algorithms

Principal Component Analysis (PCA) is a data dimension reduction technique used for data preprocessing. Generally, the dimensions of the raw data we obtain are very high. For example, the 1000 features may contain a lot of useless information or noise. Only 1000 of the truly useful features are available, then we can use the PCA algorithm to reduce 1000 features to 100 features. This not only removes unnecessary noise, but also reduces the amount of computing.

How is the PCA algorithm implemented?

Simply put, data is transferred from the original space to the new feature space. For example, the original space is three-dimensional (x, y, z ), x, y, and z are the three bases of the original space. We can use a method to represent the original data in a new coordinate system (a, B, c, then a, B, and c are the new bases, which form a new feature space. In the new feature space, the projection of all data on c may be close to 0, which can be ignored. Then we can directly use (a, B) to represent data, in this way, the data is reduced from three-dimensional (x, y, z) to two dimensional (a, B ).

The question is how to obtain a new base (a, B, c )?

The general steps are as follows: first, the original data is zero-mean, then the covariance matrix is obtained, and then the feature vectors and feature values are obtained for the covariance matrix. These feature vectors form a new feature space. For details, we recommend Andrew Ng's webpage Tutorial: Ufldl principal component analysis, which is very detailed.


2. PCA algorithm implementation language: Python function library: Numpy
>>> import numpy as np

Based on the general steps mentioned above, implement PCA algorithm (1) zero mean. If the original data set is a matrix dataMat, each row in dataMat represents a sample, and each column represents the same feature. Zero mean means to calculate the average value of each column, and then all the numbers in the column minus the mean value. That is to say, here, the zero mean is for each feature, and the mean of each feature is changed to 0. The implementation code is as follows:
Def zeroMean (dataMat): meanVal = np. mean (dataMat, axis = 0) # calculate the mean value by column, that is, calculate the mean value of each feature newData = dataMat-meanVal return newData, meanVal

In the function, the mean method in numpy is used to calculate the mean, and axis = 0 is used to calculate the mean by column. This function returns two variables. newData is the zero-mean data, and meanVal is the mean of each feature, which is used for subsequent reconstruction data.
(2) returns the covariance matrix.
   newData,meanVal=zeroMean(dataMat)   covMat=np.cov(newData,rowvar=0)

The cov function in numpy is used to evaluate the covariance matrix. The rowvar parameter is very important! If rowvar = 0, the input data row represents a sample. If it is not 0, the input data column represents a sample. Since each row of newData represents a sample, rowvar is set to 0. CovMat is the covariance matrix.
(3) Calculate the feature value and Feature Matrix call the eig function in linalg of the linear algebra module in numpy. The feature value and feature vector can be obtained directly from covMat:
eigVals,eigVects=np.linalg.eig(np.mat(covMat))

EigVals stores feature values and row vectors. EigVects stores feature vectors. Each column carries a feature vector. Feature values correspond to feature vectors one by one.

(4) Retain the main component [that is, the first n features with relatively large reserved values] Step 3 obtains the feature value vector eigVals. Suppose there are m feature values in it, we can sort them, the feature vectors corresponding to the n first feature values are what we need to retain. They form a group of n_eigVect in the new feature space. Multiply the zero-mean data by n_eigVect to get the data after dimensionality reduction. The Code is as follows:
EigValIndice = np. argsort (eigVals) # Sort the feature values from small to large n_eigValIndice = eigValIndice [-1:-(n + 1 ): -1] # subscript n_eigVect = eigVects of the n largest feature values [:, n_eigValIndice] # feature vector lowDDataMat = newData * n_eigVect # data reconMat = (lowDDataMat * n_eigVect.T) + meanVal # reconstruction data return lowDDataMat, reconMat

There are several points in the code to explain. First, argsort sorts the feature values from small to large, so the n largest feature values are listed below, so eigValIndice [-1:-(n + 1) :-1], the subscript corresponding to the n feature values is obtained. In python, list [a: B: c] indicates starting from subscript a to B, and the step size is c .]
ReconMat is the reconstructed data, multiplied by the n_eigVect transpose matrix, plus the mean meanVal.
OK. In these four steps, we can obtain lowDDataMat from the high-dimensional data dataMat. In addition, the program also returns reconMat, which sometimes facilitates data analysis.
Paste the general code:
# Zero-mean def zeroMean (dataMat): meanVal = np. mean (dataMat, axis = 0) # calculate the mean by column, that is, calculate the mean of each feature newData = dataMat-meanVal return newData, meanValdef pca (dataMat, n): newData, meanVal = zeroMean (dataMat) covMat = np. cov (newData, rowvar = 0) # returns the covariance matrix and ndarray. If rowvar is not 0, one column represents a sample, which is 0, one row represents a sample eigVals, and eigVects = np. linalg. eig (np. mat (covMat) # Calculate the feature value and feature vector. feature vectors are placed by column, that is, a column represents a feature vector eigValIndice = np. argsort (eigVals) # Sort the feature values from small to large n_eigValIndice = eigValIndice [-1:-(n + 1 ): -1] # subscript n_eigVect = eigVects of the n largest feature values [:, n_eigValIndice] # feature vector lowDDataMat = newData * n_eigVect # data reconMat = (lowDDataMat * n_eigVect.T) + meanVal # reconstruction data return lowDDataMat, reconMat



3. The article on selecting the number of principal components has not been completed yet. When Using PCA, how do we know that it is reasonable to reduce the number of data to a certain dimension for a 1000-Dimension Data? That is, how much does n need to retain the most information and remove the most noise at the same time? Generally, we use the variance percentage to determine n. This is clearly stated in the Ufldl tutorial and there is a simple formula. below is the formula:



According to this formula, you can write a function. The input parameter of the function is percentage and feature value vector, and then determine n Based on percentage. The Code is as follows:
Def percentage2n (eigVals, percentage): sortArray = np. sort (eigVals) # sortArray = sortArray [-1:-1] # reverse, that is, arraySum = sum (sortArray) in descending order) tmpSum = 0 num = 0 for I in sortArray: tmpSum + = I num + = 1 if tmpSum> = arraySum * percentage: return num



The pca function can also be rewritten to the percentage version, with the default percentage of 99%.
Def pca (dataMat, percentage = 0.99): newData, meanVal = zeroMean (dataMat) covMat = np. cov (newData, rowvar = 0) # returns the covariance matrix and ndarray. If rowvar is not 0, one column represents a sample, which is 0, one row represents a sample eigVals, and eigVects = np. linalg. eig (np. mat (covMat) # Calculate the feature value and feature vector. feature vectors are placed by column, that is, a column represents a feature vector n = percentage2n (eigVals, percentage) # To reach the percentage of variance of percent, the first n feature vectors eigValIndice = np are required. argsort (eigVals) # Sort the feature values from small to large n_eigValIndice = eigValIndice [-1:-(n + 1 ): -1] # subscript n_eigVect = eigVects of the n largest feature values [:, n_eigValIndice] # feature vector lowDDataMat = newData * n_eigVect # data reconMat = (lowDDataMat * n_eigVect.T) + meanVal # reconstruction data return lowDDataMat, reconMat


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.