http://blog.csdn.net/jerr__y/article/details/53188573
This article mainly refer to the following article, the text of the code is basically the second article of the Code handwritten implementation of a bit.
-PCA Explanation: http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html
-Python implementation: http://blog.csdn.net/u012162613/article/details/42177327
Overall code
"" "The Total code. Func: The original characteristic matrix is reduced to dimension, and Lowdatamat is returned to the new feature matrix after descending Koriyuki. Usage:lowddatamat = PCA (Datamat, K) "" "# 0 is the value ofDefZeromean(Datamat):# Find the average of each column feature meanval = Np.mean (Datamat, axis=0) NewData = datamat-meanvalReturn NewData, MeanvalDefPca (datamat,k): Newdata,meanval=zeromean (Datamat) Covmat=np.cov (NewData,rowvar=0) #求协方差矩阵, return ndarray; if Rowvar is not 0, a column represents a sample, 0, and one row represents a sample eigvals, Eigvects=np.linalg.eig (Np.mat (Covmat)) #求特征值和特征向量, eigenvectors are placed in columns, i.e. one column represents a eigenvectors eigvalindice= Np.argsort (eigvals) #对特征值从小到大排序 k_eigvalindice=eigvalindice[- 1:-(k+1):-< span class= "Hljs-number" >1] #最大的k个特征值的下标 K_eigvect=eigvects[:,k_eigvalindice] # The largest k eigenvalues correspond to the eigenvector lowddatamat=newdata*k_eigvect #低维特征空间的数据 return Lowddatamat# reconmat= (lowddatamat*k_eigvect.t) +meanval #重构数据 # return lowddatamat,reconmat
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
Next step to implement PCA
(0) Prepare the data first.
import numpy as np
# n-dimensional raw data, In this case, n=2. data = Np.array ([[2.5,2.4], [0.5, Span class= "Hljs-number" >0.7], [2.2, 2.9], [1.9, 2.2", [3.1, 3.0], [2.3, 2.7], [2, 1.6], [1, 1.1], [ 1.5, 1.6], [1.1, 0.9]]) print data
[[ 2.5 2.4] [ 0.5 0.7] [ 2.2 2.9] [ 1.9 2.2] [ 3.1 3. ] [ 2.3 2.7] [ 2. 1.6] [ 1. 1.1] [ 1.5 1.6] [ 1.1 0.9]]
(1) 0 of the average value
# (1)零均值化def zeroMean(dataMat): # 求各列特征的平均值 meanVal = np.mean(dataMat, axis=0) newData = dataMat - meanVal return newData, meanValnewData, meanVal = zeroMean(data)print ‘the newData is \n‘, newDataprint ‘the meanVal is \n‘, meanVal
the newData is [[ 0.69 0.49] [-1.31 -1.21] [ 0.39 0.99] [ 0.09 0.29] [ 1.29 1.09] [ 0.49 0.79] [ 0.19 -0.31] [-0.81 -0.81] [-0.31 -0.31] [-0.71 -1.01]]the meanVal is [ 1.81 1.91]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
(2) Covariance matrix of the features of each dimension
# (2)求协方差矩阵,rowvar=036表示每列对应一维特征covMat = np.cov(newData, rowvar=0)print covMat# 若rowvar=1表示没行是一维特征,每列表示一个样本,显然咱们的数据不是这样的# covMat2 = np.cov(newData, rowvar=1)# print covMat2
[[ 0.61655556 0.61544444] [ 0.61544444 0.71655556]]
(3) The eigenvalues and eigenvectors of the covariance matrix in (2)
# (3)求协方差矩阵的特征值和特征向量,利用numpy中的线性代数模块linalg中的eig函数eigVals, eigVects = np.linalg.eig(np.mat(covMat))print ‘特征值为:\n‘, eigValsprint ‘特征向量为\n‘, eigVects
特征值为:[ 0.0490834 1.28402771]特征向量为[[-0.73517866 -0.6778734 ] [ 0.6778734 -0.73517866]]
In the above results:
The characteristic values are:
[ 0.0490834 1.28402771]
Feature vectors are
[[-0.73517866 -0.6778734 ]
[0.6778734 -0.73517866]]
Eigenvalue 0.0490834 corresponds to the first column of the eigenvector ( -0.73517866 0.6778734) T
(4) dimensionality reduction to K-Dimension (K < n)
# (4) preserving the main components, sorting the eigenvalues in order from large to small, selecting the largest of the K, and then using the corresponding K-eigenvectors as the eigenvector matrix of the column vectors respectively.# For example, this example preserves 1.28402771 corresponding eigenvectors ( -0.6778734-0.73517866) ^TK =1 # This example takes k = 1eigValIndice = Np.argsort (eigvals) # from small to large sort N_eigvalindice = Eigvalindice[-1:-(k+1) :-< span class= "Hljs-number" >1] # the highest value of k subscript n_eigvect = eigvects[:, N_eigvalindice] # take corresponding K eigenvector print n_eigvectprint n_ Eigvect.shapelowdatamat = Newdata*n_eigvect # data for low-dimensional feature space Reconmat = (Lowdatamat * N_eigVect.T + meanval # reconstruct data, get data after dimensionality print print
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
[[-0.6778734 ] [-0.73517866]](2L, 1L)
The sample points are projected onto the selected low-dimensional eigenvectors, and the result is actually used as a new feature :
[[-0.82797019] [ 1.77758033] [-0.99219749] [-0.27421042] [-1.67580142] [-0.9129491 ] [ 0.09910944] [ 1.14457216] [ 0.43804614] [ 1.22382056]]降维之后的样本:[[ 2.37125896 2.51870601] [ 0.60502558 0.60316089] [ 2.48258429 2.63944242] [ 1.99587995 2.11159364] [ 2.9459812 3.14201343] [ 2.42886391 2.58118069] [ 1.74281635 1.83713686] [ 1.03412498 1.06853498] [ 1.51306018 1.58795783] [ 0.9804046 1.01027325]]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
Samples after dimensionality reduction:
[[2.37125896 2.51870601]
[0.60502558 0.60316089]
[2.48258429 2.63944242]
[1.99587995 2.11159364]
[2.9459812 3.14201343]
[2.42886391 2.58118069]
[1.74281635 1.83713686]
[1.03412498 1.06853498]
[1.51306018 1.58795783]
[0.9804046 1.01027325]]
Original sample:
[[2.5 2.4]
[0.5 0.7]
[2.2 2.9]
[1.9 2.2]
[3.1 3.]
[2.3 2.7]
[2.1.6]
[1.1.1]
[1.5 1.6]
[1.1 0.9]]
By comparison, we can see that we have succeeded in realizing the characteristics from two dimensions to one dimension after dimensionality reduction, and then have some changes to the original data after dimensionality reduction.
We can think of eliminating part of the noise in this way (which is, of course, probably losing some of the real information).
——————————————-Split Line ———————————————————
Using Sklearn to implement PCA
- Reference post: http://blog.csdn.net/u012162613/article/details/42192293
# raw Data = Np.array ([[2.5,2.4], [0.5, 0.7], [2.2, 2.9], [ 1.9, 2.2], [3.1, 3.0], [2.3, 2.7], [2, 1.6", [1, 1.1], [ Span class= "Hljs-number" >1.5, 1.6], [1.1, 0.9]]) # print data
# 好吧,就是这么简单from sklearn.decomposition import PCApca = PCA(n_components=1)new_feature = pca.fit_transform(data)print new_feature
[[-0.82797019]
[1.77758033]
[-0.99219749]
[-0.27421042]
[-1.67580142]
[-0.9129491]
[0.09910944]
[1.14457216]
[0.43804614]
[1.22382056]]
Turn: The python implementation of PCA