Principal component Analysis (principal components ANALYSIS,PCA) is a simple machine learning algorithm, the main idea is to reduce the dimension of high-dimensional data processing, to remove redundant information and noise in the data.
Algorithm:
Input sample: D={x1,x2,⋯,xm} d=\left \{x_{1},x_{2},\cdots, x_{m}\right \}
The dimension of low latitude space
Process: •
1: All samples are centralized: Xi←xi−1m∑mi=1xi x_i\leftarrow x_i-\frac{1}{m}\sum_{i=1}^{m}x_i;
2: Calculate covariance matrix for all samples: XXT xx^t;
3: The covariance matrix XXT xx^t do eigenvalue decomposition;
4: Take the maximum d′{d} ' eigenvalues to do the corresponding eigenvector w1,w2,⋯,wd′w_1,w_2,\cdots, W_{d '}.
Output: Projection matrix w= (w1,w2,⋯,wd′) w= (W_1,w_2,\cdots, W_{d '})
PCA algorithm is mainly used in image compression, image fusion, human face recognition: PCA
The interface for PCA is given in Python's Sklearn package:
From sklearn.decomposition import PCA
import numpy as NP
x=np.array ([[ -1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[ 3,2]])
#pca =PCA (n_components=2)
pca=pca (n_components= ' mle ')
pca.fit (X)
print (pca.explained_ Variance_ratio_)
Test and test with a dataset of your own making
The program extracts a characteristic value to reduce the dimensionality of two-dimensional data
Using PCA algorithm to reduce dimension of testSet.txt data set
Import NumPy as NP import Matplotlib.pyplot as Plt def loaddataset (filename, delim= ' \ t '): FR = open (filename) S
Tringarr = [Line.strip (). Split (Delim) for line in Fr.readlines ()] Datarr = [Map (float, line) for line in Stringarr] Return Np.mat (Datarr) def PCA (Datamat, topnfeat=9999999): Meanvals = Np.mean (Datamat, axis=0) meanremoved = da Tamat-meanvals # remove Mean Covmat = Np.cov (meanremoved, rowvar=0) # Look for the most variance direction a,var (a ' X) =a ' cov (x) a direction error Max EIGV
ALS, eigvects = Np.linalg.eig (Np.mat (covmat)) Eigvalind = Np.argsort (eigvals) # Sort, sort goes smallest to largest Eigvalind = eigvalind[:-(topnfeat + 1): -1] # cut off unwanted dimensions redeigvects = eigvects[:, Eigvalind] # R
Eorganize Eig vects Largest to smallest Lowddatamat = meanremoved * redeigvects # Transform data into new dimensions Reconmat = (Lowddatamat * redeigvects.t) + meanvals return Lowddatamat, Reconmat Datamat = Loaddataset (' TestSet . txt ') print (Datamat) LoWdmat, Recomat = PCA (Datamat, 1) Print (U ' eigenvalue is: ') print (lowdmat) print (U ' eigenvectors are: ') print (recomat) FIG = plt.figure () ax = f Ig.add_subplot (111) Ax.scatter (Np.array (datamat[:, 0]), Np.array (datamat[:, 1]), marker= ' ^ ', s=90) Ax.scatter ( Np.array (recomat[:, 0]), Np.array (recomat[:, 1]), marker= ' O ', s=50, c= ' Red ') plt.show () def replacenanwithmean (): DA Tmat = Loaddataset (' secom.data ', ') numfeat = Np.shape (Datmat) [1] for I in range (numfeat): Meanval = NP. Mean (Datmat[np.nonzero (~np.isnan (datmat[:, i). A)) [0], I]) Datmat[np.nonzero (Np.isnan (datmat[:, I]. A)) [0], I] = meanval return Datmat Datamat = Replacenanwithmean () meanvals = Np.mean (Datamat, axis=0) meanremoved =
Datamat-meanvals # remove Mean Covmat = Np.cov (meanremoved, rowvar=0) eigvals, eigvects = Np.linalg.eig (Np.mat (Covmat)) Eigvalind = Np.argsort (eigvals) # Sort, sort goes smallest to largest eigvalind = eigvalind[::-1] # reverse SORTEDEIGVA ls = eigvals[eigvalind] total = SUM (sortedeigvals) VARpercentage = sortedeigvals/total * 100 # Calculates principal component Variance fig = plt.figure () ax = Fig.add_subplot (111) Ax.plot (range (1,%), Var PERCENTAGE[:20], marker= ' ^ ') plt.xlabel (' Principal Component number ') Plt.ylabel (' Percentage of Variance ') plt.show ()
Results:
The blue triangle is the original data, the red circle is the main direction of the data, and you can see the PCA algorithm to find the main direction of the data well.
Human Face Recognition:
Att_faces contains 40 faces, each face 10 92*112 pixels grayscale Photo Data set
Here is an example of the att_faces data set:
Import OS import operator from numpy import * Import matplotlib.pyplot as PLT Import Cv2 # define PCA def PCA (DATA,K): data = float32 (data) Rows,cols = data.shape# Fetch size Data_mean = mean (data,0) Data_mean_all = Tile (Data_mean , (rows,1)) Z = data-data_mean_all# Center T1 = z*z.t #计算样本的协方差 d,v = Linalg.eig (T1) #特征值与特征向量 V1 = v[:,0:k]#
Take the first k eigenvectors V1 = Z.t*v1 for I in range (k): #特征向量归一化 L = Linalg.norm (V1[:,i]) v1[:,i] = v1[:,i]/l Data_new = z*v1 # data return data_new,data_mean,v1# training result after descending dimension #covert image to Vector def img2vector (filename): img = Cv2.imread (filename,0) #读取图片 rows,cols = Img.shape imgvector = Zeros ((1,rows*cols)) #create a none vectore:to r Aise Speed Imgvector = Reshape (img, (1,rows*cols)) #change img from 2D to 1D return imgvector #load dataSet def Lo Addataset (k): #choose K (0-10) people as traintest for everyone # #step 1:getting data Set Print ("--getting data S ET---") #note to use '/' not ' \ ' Datasetdir =