1. Background PCA (Principal Component analysis), the role of PAC is mainly to reduce the dimensions of the data set, and then select the basic features. The main idea of PCA is to move the axes and find the eigenvalues in the direction of the most variance. What is the eigenvalue of the direction with the greatest variance? Just like in the curve B. The same. It covers the widest range.
Basic steps: (1) first calculate the covariance matrix (2) of the data set to compute the eigenvalues and eigenvectors of the covariance matrix (3) preserving the most important n features
What is the covariance matrix:
The definition is the variable vector minus the mean vector, then multiplies by the variable vector minus the transpose of the mean vector and then the mean value.For example, X is a variable, μ is the mean, the covariance matrix equals e[(x-μ) (x-μ) ^t], the physical meaning is this, such as x= (X1,x2,..., XI) then the number of rows n columns of the covariance matrix is the covariance of XM and Xn, if m=n. is the variance of the xn. Assuming that the elements of X are independent, then the covariance matrix has only a diagonal value, and since X is independent, the covariance between XM and xn is 0 for m≠n. In addition the covariance matrix is symmetric.
Be able to refer to the wiki: (http://zh.wikipedia.org/wiki/%E5%8D%8F%E6%96%B9%E5%B7%AE%E7%9F%A9%E9%98%B5)
2. Code implementation pseudocode such as the following (excerpt from Machine learning Combat):
' @author: Garvin ' from numpy import *import Matplotlib.pyplot as Pltdef loaddataset (fileName, delim= ' \ t '): FR = Open (fileName) Stringarr = [Line.strip (). Split (Delim) for line in Fr.readlines ()] Datarr = [Map (float,line) to line in Stringarr] Return Mat (Datarr) def PCA (Datamat, topnfeat=9999999): Meanvals = Mean (Datamat, axis=0) meanremoved = d Atamat-meanvals #remove mean Covmat = CoV (meanremoved, rowvar=0) eigvals,eigvects = Linalg.eig (Mat (Covmat)) Eig Valind = Argsort (eigvals) #sort, sort goes smallest to largest eigvalind = eigvalind[:-(topnfeat+1): -1] #cu T off unwanted dimensions redeigvects = eigvects[:,eigvalind] #reorganize Eig vects largest to smallest Lowdda Tamat = meanremoved * Redeigvects#transform data into new dimensions Reconmat = (Lowddatamat * redeigvects.t) + Meanval s return Lowddatamat, Reconmatdef plotbestfit (dataset1,dataset2): dataArr1 = Array (dataSet1) DATAARR2 = arr Ay (dataSet2) n = ShapE (DATAARR1) [0] N1=shape (DATAARR2) [0] xcord1 = []; Ycord1 = [] Xcord2 = []; Ycord2 = [] xcord3=[];ycord3=[] j=0 for I in Range (n): Xcord1.append (dataarr1[i,0]); Ycord1.append (dataarr1[i,1]) xcord2.append (dataarr2[i,0]); Ycord2.append (dataarr2[i,1]) FIG = plt.figure () ax = Fig.add_subplot (111) ax.scatter (Xcord1, YC Ord1, s=30, c= ' Red ', marker= ' s ') Ax.scatter (Xcord2, Ycord2, s=30, c= ' green ') Plt.xlabel (' X1 '); Plt.ylabel (' X2 '); Plt.show () If __name__== ' __main__ ': Mata=loaddataset ('/users/hakuri/desktop/testset.txt ') a,b= PCA (Mata, 2)
The Loaddataset function is an import dataset. PCA input Parameters: the number of inputs is the input data set. The number of parameters is the dimension of extraction. For example, the parameter two is set to 1. Then it returns to a matrix that is reduced to one dimension. PCA returns the number of references: The reference is the low-dimensional matrix returned. corresponding to the input parameters of two.
The number of references two corresponds to the matrix after the axis is moved.
The previous picture. Green is the raw data. Red is a 2-dimensional feature of extraction.
3. Code Download:Please click on my
/********************************
* This article from the blog "Bo Li Garvin"
* Reprint Please indicate the source : Http://blog.csdn.net/buptgshengod
******************************************/
"Machine Learning Algorithm-python realization" PCA principal component analysis, dimensionality reduction