The source of this article: Http://blog.csdn.net/xizhibei
=============================
PCA, also known as principalcomponents analysis , is a very good algorithm, according to the book:
Looking for the projection method that best represents the original data in the least mean square sense
And then his own argument is: mainly used for features of the dimensionality reduction
In addition, the algorithm also has a classic application: human face recognition. Here a little bit, nothing but the processing of the face picture of each line together as a feature vector, and then use the PAC algorithm to reduce the dimension.
The main idea of PCA is to find the direction of the spindle of the data, a new coordinate system is formed, the dimension can be lower than the original dimension, and then the data is projected from the original coordinate system to the new coordinate system, and the process of the projection can be the dimensionality reduction process.
Derivation process God horse is not to pull, recommend a courseware: Http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf, speak very specific
And then the steps of the algorithm
1. Calculate the mean m and scatter matrix s of all samples, so-called scatter matrix with covariance matrix;
2. Calculate the eigenvalues of S and then sort from large to small;
3. Select the corresponding feature vectors of the first n ' eigenvalues to make a transformation matrix e=[e1, E2, ..., en '];
4. Finally, for each of the previous n-dimensional feature vector x can be converted to n ' dimension of the new feature vector y:
y = Transpose (E) (X-M)
Finally also have to do the talent to remember live: With Python numpy do, with C do words that is nothing, too much trouble, because of NumPy not familiar, the following may be wrong, hope you greatly correct
Mat = Np.load ("data.npy") #每一行一个类别数字标记与一个特征向量data = Np.matrix (mat[:,1:]) Avg = np.average (data,0) means = Data-avgtmp = NP. Transpose (means) * means/n #N为特征数量D, V = Np.linalg.eig (tmp) #DV分别相应特征值与特征向量组成的向量, it should be noted that the result is self-ordered, again worship NumPy Otl#print v#print DE = v[0:100,:] #这里仅仅是简单取前100维数据, the actual situation can be considered to take the first 80% and the like y = Np.matrix (E) * Np.transpose (means) # Get the feature vector Np.save ("final", y) after dimensionality reduction
In addition, the need to mention is OPENCV (omnipotent OpenCV ah OTL) has the implementation of PCA:
void CVCALCPCA (const cvarr* data,//input Data cvarr* AVG,//average (output) cvarr* eigenvalues,//eigenvalue (output) cvarr* eigenvectors,//feature vector (output) int flags);//How the eigenvectors in the input data are placed, for example Cv_pca_data_as_row
Finally, the disadvantage of PCA is thatPCA treats all samples (eigenvector sets) as a holistic approach, looking for an optimal linear mapping projection with the least meaning of mean square error, ignoring the class attribute, and the projection direction it ignores may just include important information about the classification.
Well, finally--well, no, it's the end.
Highly recommended: An article that can make PAC very thorough " characteristic vector physical meaning":http://blog.sina.com.cn/s/blog_49a1f42e0100fvdu.html
A little learning summary of PCA algorithm