Principal component Analysis (PCA) is an effective method of compressing and de-noising the data based on the covariance matrix of variables, the idea of PCA is to map n-dimensional features to K-Dimension (K<n), which is called the principal element, is a linear combination of the old features, these linear combinations maximize the sample variance, Try to make the new K features unrelated.
Related knowledge
Introduction to a PCA Tutorial: A tutorial on Principal components Analysis--lindsay I Smith
1. Covariance covariance
The covariance formula for variable x and variable y is as follows, covariance is the correlation between different variables, and covariance >0 shows that x and Y are positive correlations, covariance <0 x and Y are negative correlations, covariance is 0 o'clock X and y are independent of each other.
Covariance is calculated for two-D, and for n-dimensional datasets, C (n,2) covariance can be computed. The covariance matrix for n-dimensional data is defined as follows:
Dim (x) represents the X dimension.
For three-dimensional (x, y, z), the covariance matrix is as follows, which shows that the covariance matrix is a symmetric matrix (symmetrical) whose diagonal elements are the variance of each dimension:
2. eigenvectors and eigenvalues
If, it is called a eigenvalue, X is the corresponding eigenvector. It can be understood that the effect of matrix A on its eigenvector x only makes the length of x change, and the scale is the corresponding eigenvalue. Eigenvectors can only be found in square matrices, and not all matrices have eigenvectors, and if a n*n square has eigenvectors, then there are N eigenvectors. All eigenvectors of a matrix are orthogonal, that is, the dot product between eigenvectors is 0, in general, the eigenvector is normalized, that is, the vector length is 1.
3.PCA process
The first step is to get the data in the original data, a total of two dimensions, you can see the point on the two-dimensional plane.
is a scatter plot of data on a two-dimensional coordinate plane:
The second step, minus the average , averages each dimension in data and subtracts the average to get dataadjust data.
The third step is to calculate the dataadjust covariance matrix
The fourth step is to calculate the eigenvectors and eigenvalues of the covariance matrix and select the Eigenvector
The characteristic vectors corresponding to the eigenvalues 0.490833989 are (-0.735178656, 0.677873399), where the eigenvector is orthogonal and normalized, that is, the length is 1.
Demonstrate the relationship between Dataadjust data and eigenvectors:
A positive sign represents a pre-processed sample point, and the diagonal two lines are orthogonal eigenvectors (because the covariance matrix is symmetric, so its eigenvector is orthogonal), and the characteristic vector with the larger eigenvalues is the main component of the DataSet (principle component). In general, when the eigenvectors are computed from the covariance matrix, the next step is to sort the eigenvectors from the large to the small by the eigenvalues, which gives the order of the constituent meanings. The smaller the characteristic value of the component, the less information it contains, so it can be chosen appropriately.
If there are n dimensions in the data, the N eigenvectors and eigenvalues are computed, the first k eigenvectors are selected, and then the final data set is only k-dimensional, and the eigenvectors are named Featurevector.
Here the eigenvalues are only two, we choose the largest one, 1.28402771, the corresponding eigenvector is.
In the fifth step, the sample points are projected onto the selected eigenvectors to obtain a new data set.
Suppose the sample number is m, the characteristic number is n, the sample matrix after subtracting the mean is Dataadjust (m*n), and the covariance matrix is n*n, the matrix of the selected K eigenvectors is eigenvectors (n*k). Then the projected data FinalData to
Here is FinalData (10*1) = Dataadjust (10*2 matrix) x eigenvectors
Get results for
FinalData is the form of the data set based on the feature vectors corresponding to the maximum eigenvalues, it can be seen that the Dataadjust sample points are projected on the corresponding axes of the eigenvectors:
If the k=2 is taken, then the result is
It can be seen that if a new dataset is used for all eigenvectors, it is converted back to exactly the same data set as the original (axis rotation only).
Python implements PCA
The pseudo-code that transforms the data into the first K principal components is roughly as follows:
" " minus the mean calculates the covariance matrix, calculates the eigenvalues and eigenvectors of the covariance matrix, and the eigenvalues from large to small keep the largest k eigenvector to convert the data to the new space constructed by the K-eigenvectors above " "
The code is implemented as follows:
fromNumPyImport*defLoaddataset (FileName, delim='\ t'): Fr=Open (fileName) Stringarr= [Line.strip (). Split (Delim) forLineinchfr.readlines ()] Datarr= [Map (float,line) forLineinchStringarr]returnMat (Datarr)defPCA (Datamat, topnfeat=999999): Meanvals= Mean (Datamat, axis=0) Dataadjust= Datamat-meanvals#Subtract averageCovmat = CoV (Dataadjust, rowvar=0) Eigvals,eigvects= Linalg.eig (Mat (Covmat))#calculate eigenvalues and eigenvectors #Print EigvalsEigvalind =Argsort (eigvals) Eigvalind= eigvalind[:-(topnfeat+1):-1]#retain the largest first k eigenvaluesRedeigvects = Eigvects[:,eigvalind]#corresponding feature vectorsLowddatamat = Dataadjust * redeigvects#convert data to a low-change spaceReconmat = (Lowddatamat * redeigvects.t) + meanvals#Refactoring data for debugging returnLowddatamat, Reconmat
The test data testSet.txt consists of 1000 data points. Under the face of the data to reduce dimensions, and the Matplotlib module will be reduced after the data and the original data drawn together.
ImportmatplotlibImportMatplotlib.pyplot as Pltdatamat= Loaddataset ('TestSet.txt') Lowdmat, Reconmat= PCA (datamat,1)Print "shape (lowdmat):", shape (lowdmat) FIG=plt.figure () Ax= Fig.add_subplot (111) Ax.scatter (Datamat[:,0].flatten (). a[0],datamat[:,1].flatten (). A[0],marker='^', s=90) Ax.scatter (Reconmat[:,0].flatten (). a[0],reconmat[:,1].flatten (). A[0],marker='o', s=50,c='Red') plt.show ()
Results such as:
Python environment
1. Compilation environment: win8.1 + 32-bit + python2.7
2. Related Module Installation:
(1) NumPy and scipy:numpy are used for storing and processing large matrices for numerical calculations. SciPy is a numpy-based scientific and Engineering computing tool for processing multidimensional array vectors, matrices, graphs (graphic images are two-dimensional arrays of pixels), and tables. : http://www.scipy.org/scipylib/download.html
(2) A graphical frame of the Matplotlib:python for data plotting. installation files for Win32: Http://matplotlib.org/downloads.html
(3) Dateutil and pyparsing modules: required when installing the configuration Matplotlib package. installation files for Win32: http://www.lfd.uci.edu/~gohlke/pythonlibs/
3. The compilation encountered a problem:
(1) Hint "No module name six", copy six.py Six.pyc six.pyo three files from \python27\lib\site-packages\scipy\lib to \python27\lib\ Site-packages directory.
(2) Hint "Importerror:six 1.3 or later is required; You have 1.2.0 ", stating that the six.py version is too low, download the latest version, replace the six.py in the \python27\lib\site-packages with the six.py in it: https://pypi.python.org/pypi/six/
Note: Six modules are compatible tools for Python 2 and 3
PCA can recognize its main characteristics from the data by rotating the axes along the data's maximum variance direction. The matrix operation, the eigenvalue, the characteristic vector are not very familiar, still need to study further.
Python principal Component Analysis PCA