Principal component Analysis PCA study notes

Source: Internet
Author: User
Tags in python

Principal component Analysis (principal components ANALYSIS,PCA) is a simple machine learning algorithm, the main idea is to reduce the dimension of high-dimensional data processing, to remove redundant information and noise in the data.
Algorithm:
Input sample: D={x1,x2,⋯,xm} d=\left \{x_{1},x_{2},\cdots, x_{m}\right \}
The dimension of low latitude space

Process: •
1: All samples are centralized: Xi←xi−1m∑mi=1xi x_i\leftarrow x_i-\frac{1}{m}\sum_{i=1}^{m}x_i;
2: Calculate covariance matrix for all samples: XXT xx^t;
3: The covariance matrix XXT xx^t do eigenvalue decomposition;
4: Take the maximum d′{d} ' eigenvalues to do the corresponding eigenvector w1,w2,⋯,wd′w_1,w_2,\cdots, W_{d '}.
Output: Projection matrix w= (w1,w2,⋯,wd′) w= (W_1,w_2,\cdots, W_{d '})
PCA algorithm is mainly used in image compression, image fusion, human face recognition: PCA

The interface for PCA is given in Python's Sklearn package:

From sklearn.decomposition import PCA
import numpy as NP

x=np.array ([[ -1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[ 3,2]])
#pca =PCA (n_components=2)
pca=pca (n_components= ' mle ')
pca.fit (X)
print (pca.explained_ Variance_ratio_)

Test and test with a dataset of your own making
The program extracts a characteristic value to reduce the dimensionality of two-dimensional data

Using PCA algorithm to reduce dimension of testSet.txt data set

Import NumPy as NP import Matplotlib.pyplot as Plt def loaddataset (filename, delim= ' \ t '): FR = open (filename) S
    Tringarr = [Line.strip (). Split (Delim) for line in Fr.readlines ()] Datarr = [Map (float, line) for line in Stringarr] Return Np.mat (Datarr) def PCA (Datamat, topnfeat=9999999): Meanvals = Np.mean (Datamat, axis=0) meanremoved = da Tamat-meanvals # remove Mean Covmat = Np.cov (meanremoved, rowvar=0) # Look for the most variance direction a,var (a ' X) =a ' cov (x) a direction error Max EIGV
    ALS, eigvects = Np.linalg.eig (Np.mat (covmat)) Eigvalind = Np.argsort (eigvals) # Sort, sort goes smallest to largest Eigvalind = eigvalind[:-(topnfeat + 1): -1] # cut off unwanted dimensions redeigvects = eigvects[:, Eigvalind] # R
    Eorganize Eig vects Largest to smallest Lowddatamat = meanremoved * redeigvects # Transform data into new dimensions Reconmat = (Lowddatamat * redeigvects.t) + meanvals return Lowddatamat, Reconmat Datamat = Loaddataset (' TestSet . txt ') print (Datamat) LoWdmat, Recomat = PCA (Datamat, 1) Print (U ' eigenvalue is: ') print (lowdmat) print (U ' eigenvectors are: ') print (recomat) FIG = plt.figure () ax = f Ig.add_subplot (111) Ax.scatter (Np.array (datamat[:, 0]), Np.array (datamat[:, 1]), marker= ' ^ ', s=90) Ax.scatter ( Np.array (recomat[:, 0]), Np.array (recomat[:, 1]), marker= ' O ', s=50, c= ' Red ') plt.show () def replacenanwithmean (): DA Tmat = Loaddataset (' secom.data ', ') numfeat = Np.shape (Datmat) [1] for I in range (numfeat): Meanval = NP. Mean (Datmat[np.nonzero (~np.isnan (datmat[:, i). A)) [0], I]) Datmat[np.nonzero (Np.isnan (datmat[:, I]. A)) [0], I] = meanval return Datmat Datamat = Replacenanwithmean () meanvals = Np.mean (Datamat, axis=0) meanremoved = 
Datamat-meanvals # remove Mean Covmat = Np.cov (meanremoved, rowvar=0) eigvals, eigvects = Np.linalg.eig (Np.mat (Covmat)) Eigvalind = Np.argsort (eigvals) # Sort, sort goes smallest to largest eigvalind = eigvalind[::-1] # reverse SORTEDEIGVA ls = eigvals[eigvalind] total = SUM (sortedeigvals) VARpercentage = sortedeigvals/total * 100 # Calculates principal component Variance fig = plt.figure () ax = Fig.add_subplot (111) Ax.plot (range (1,%), Var PERCENTAGE[:20], marker= ' ^ ') plt.xlabel (' Principal Component number ') Plt.ylabel (' Percentage of Variance ') plt.show ()

Results:
The blue triangle is the original data, the red circle is the main direction of the data, and you can see the PCA algorithm to find the main direction of the data well.
Human Face Recognition:

Att_faces contains 40 faces, each face 10 92*112 pixels grayscale Photo Data set

Here is an example of the att_faces data set:

Import OS import operator from numpy import * Import matplotlib.pyplot as PLT Import Cv2 # define PCA def PCA (DATA,K): data = float32 (data) Rows,cols = data.shape# Fetch size Data_mean = mean (data,0) Data_mean_all = Tile (Data_mean , (rows,1)) Z = data-data_mean_all# Center T1 = z*z.t #计算样本的协方差 d,v = Linalg.eig (T1) #特征值与特征向量 V1 = v[:,0:k]#

    Take the first k eigenvectors V1 = Z.t*v1 for I in range (k): #特征向量归一化 L = Linalg.norm (V1[:,i]) v1[:,i] = v1[:,i]/l  Data_new = z*v1 # data return data_new,data_mean,v1# training result after descending dimension #covert image to Vector def img2vector (filename): img = Cv2.imread (filename,0) #读取图片 rows,cols = Img.shape imgvector = Zeros ((1,rows*cols)) #create a none vectore:to r Aise Speed Imgvector = Reshape (img, (1,rows*cols)) #change img from 2D to 1D return imgvector #load dataSet def Lo Addataset (k): #choose K (0-10) people as traintest for everyone # #step 1:getting data Set Print ("--getting data S ET---") #note to use '/' not ' \ ' Datasetdir =  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.