PCA (principal component analysis) MATLAB implementation

Source: Internet
Author: User

First, Introduction

PCA (Principal component analysis), which is the principal components of image processing, is often used in the dimensionality reduction method, you know, we are dealing with digital image processing problems, such as the frequently used image query problem, Query a similar image in a database of tens of thousands of or millions of or larger. At this time, our usual method is to extract the characteristics of the images in the image library, such as color, texture, sift,surf,vlad and so on, then save it, set up the data index of the response, and then extract the corresponding features to the image, and compare with the image features in the database to find the most recent picture. Here, if we in order to improve the accuracy of the query, usually extract some more complex features, such as Sift,surf, an image has a lot of this feature point, each feature point has a corresponding description of the feature point of the 128-dimensional vector, imagine if an image has 300 of this feature point, Then the image has 300*vector (128-D), if we have 1 million images in the database, this storage is quite large, the index is also very time-consuming, if we do the PCA for each vector, it will be reduced to 64-dimensional, is not very save space ah? For the learning image processing of people, all know that PCA is reduced dimension, but, many people do not know the specific principle, for this, I write this article, to elaborate on the PCA and its specific calculation process:

Second, PCA detailed

1. Raw DATA:

For convenience, we assume that the data is two-dimensional, with a set of data on the network, as follows:

x=[2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1]t
y=[2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9]t

2. Calculate the covariance matrix

What is a covariance matrix? Believe that the people who read this article have learned mathematical statistics, some basic common sense know, but perhaps you have not looked at for a long time, have forgotten almost, in order to facilitate a better understanding, here first briefly review the relevant knowledge of mathematical statistics, of course, if you know the covariance matrix of the method you can skip here.

(1) Covariance matrix:

First we give you a set of n samples, which in turn gives some related concepts in mathematical statistics:

Mean value:
Standard deviation:
Variance:

Since we all have so many statistics describing the relationship between data, why should we use covariance? We should note that the standard deviation and variance are generally used to describe one-dimensional data, but in real life we often encounter data sets containing multidimensional data, the simplest of us to go to school to count the test scores of multiple disciplines. In the face of such datasets, of course we can calculate their variances independently of each dimension, but usually we also want to understand the relationship between these scores, and then we will use covariance, covariance is a statistic used to measure the relationship of two random variables, which is defined as:

From the definition of covariance we can also see some obvious properties, such as:

(Variance of X)

It should be noted that covariance can only deal with two-dimensional problems, that is, the number of dimensions of the natural need to calculate multiple covariance, such as n-dimensional datasets need to calculate a covariance, it is natural that we think of using matrices to organize the data. Give the definition of the covariance matrix:

This definition is still very easy to understand, we can give a simple three-dimensional example, assuming that the dataset has three dimensions, then the covariance matrix is

As can be seen, the covariance matrix is a symmetric matrix, and the diagonal is the variance on each dimension.

(2) The method for finding covariance matrices:

The covariance matrix calculates the covariance between different dimensions, rather than between different samples. Below we will use an example in MATLAB to explain in detail:

Firstly, a 10*3-dimensional integer matrix is randomly generated as a sample set, 10 is the number of samples, and 3 is the dimension of the sample.
mysample = Fix (rand (10,3) *50)

According to the formula, it is necessary to calculate the mean for the covariance, which is the average or column by row, and I always bothered with this problem from the start. In particular, we have emphasized that the covariance matrix is the covariance between different dimensions , and always keep this in mind. each row of the sample matrix is a sample, each column is a dimension , so we are going to calculate the mean by columns . To describe the convenience, we first assign the data of three dimensions:

DIM1 = Mysample (:, 1);
DIM2 = Mysample (:, 2);
DIM3 = Mysample (:, 3);

Calculate the covariance of dim1 and dim2,dim1 with DIM3,DIM2 and dim3:

Sum ((Dim1-mean (DIM1)). * (Dim2-mean (DIM2))/(Size (mysample,1)-1)% get 74.5333
Sum ((Dim1-mean (DIM1)). * (Dim3-mean (DIM3))/(Size (mysample,1)-1)% get-10.0889
Sum ((Dim2-mean (DIM2)). * (Dim3-mean (DIM3))/(Size (mysample,1)-1)% get -10***000

It's much easier to figure this out, and the diagonal of the covariance matrix is the variance on each dimension, which we'll calculate in turn:

STD (DIM1) ^2% gets 108.3222
STD (dim2) ^2% gets 260.6222
STD (dim3) ^2% gets 94.1778

In this way, we get all the data needed to compute the covariance matrix and invoke the COV function from Matlab to verify:

CoV (mysample)

We can see that the results are the same as our calculations, stating that our calculations are correct. But usually we do not use this method, but the following simplified method to calculate:

Let the sample matrix be centered, that is, each dimension subtracts the mean of the dimension, then multiply it directly with the new sample matrix and then divide by (N-1). In fact, this method is also from the previous formula channel, but it is not very intuitive to understand. Everyone can write a small matrix of their own to see the understanding. Its MATLAB code is implemented as follows:

X = Mysample–repmat (mean (mysample), 10, 1); % Centralized Sample matrix
C = (X ' *x)./(Size (x,1)-1)

(for the convenience of Matlab do not understand the people, a little explanation of the various functions, the same, for MATLAB has a certain basis for people directly skip:

B = Repmat (a,m,n) percent of the matrix A is copied into the MXN block, that is, a as an element of B, b by the MXN a tile. The dimension of B is [size (a,1) *m, (Size (a,2) *n]

B = mean (A)Description of:

If you have such a matrix: a = [1 2 3; 3 3 6; 4 6 8; 4 7 7];
With mean (A) (default dim=1), the mean value of each column is calculated
Ans =
3.0000 4.5000 6.0000
Use mean (a,2) to ask for the mean value of each row
Ans =
2.0000
4.0000
6.0000

6.0000

size (a,n) percent If you add an item n to the input parameter of the size function and assign a value of N with 1 or 2, size returns the number of rows or columns of the matrix. where R=size (a,1) The statement returns the number of rows of matrix A, c=size (a,2) The statement returns the number of columns of matrix A)

We have simply said the covariance matrix and its method, we use the simplified method above to find the covariance matrix of the sample:

3. Calculating the eigenvectors and eigenvalues of the covariance matrix

Since the covariance matrix is a square, we can calculate its eigenvectors and eigenvalues, as follows:

[Eigenvectors,eigenvalues] = EIG (cov)

We can see that these vectors are unit vectors, that is, their length is 1, which is important for PCA.

4. Select the component composition mode vector

After finding the eigenvalues and eigenvectors of the covariance matrix, the eigenvalues are arranged from large to small, which gives the importance level of the components. Now, if you like, you can ignore the less important ingredients, of course, this will lose some information, but if the corresponding eigenvalues are small, you won't lose a lot of information. If you have omitted some of the ingredients, then the final data set will have fewer dimensions, and to be precise, if your original data is n-dimensional and you have selected the top p components, then your current data will be only p-dimensional. Now what we're going to do is make up a pattern vector, which is just an interesting name for several vector-composed matrices, which are made up of all the feature vectors you keep, each of which is a column of this matrix.

For our datasets, there are two feature vectors, so we have two choices. We can use two feature vectors to form a pattern vector:

We can also omit a feature vector of the smaller eigenvalues to get the following pattern vectors:

5. Get the data after dimensionality reduction

Where Rowfeaturevector is the transpose of a matrix consisting of a pattern vector as a column, so its line is the original pattern vector, and the feature vector corresponding to the maximum eigenvalue is on the top row of the matrix. Rowdataadjust is the transpose of the matrix after each dimension of data minus the mean, i.e. the data item in each column, each row is one dimension, for our sample is the first behavior x-dimensional data, the second behavior on the Y-dimension of the data. FinalData is the last data to be obtained, the data item in its column, the dimension along the line.

What results will this give us? This will just give us the data we choose. Our raw data has two axes (x and y), so our raw data is distributed on these two axes. We can represent our data by any two axes we like. If these axes are orthogonal, this expression will be most effective, which is the importance of the characteristic vectors always orthogonal. We have transformed our data from the original XY axis representation to the present single feature vector representation.

(Note: If you want to restore the original data, simply reverse the process calculation, that is:

So far, I believe you have mastered the PCA and its principles.

Http://www.360doc.com/content/14/0526/06/15831056_380900310.shtml


Code implementation:


The function [y, P, V, MX]=GETPCA (x) percent   of the input x is a two-dimensional matrix, the PCA result, V for the eigenvalues from the large to the small sort, P for the corresponding eigenvectors of the matrix, Y is the result of the transformation, MX is the average value of the same large matrix                                   %x: MxN Matrix (M dimensions, N trials)%y:y=p*x%p:the transform matrix%v:the variance vector[m,n]=size (X);    Each row is the most sample, and each column as a dimension calculates the covariance matrix Mx=mean (x,2)     between different samples. The percent of each row of the matrix is averaged Mx=repmat (mx,1,n); percent of the Matrix MX copy 1xN block x=x-mx;           The method of calculating covariance by percent: first let the sample matrix be centered, that is, each dimension minus the mean of the dimension,                  and then multiply it directly with the new sample matrix by its transpose, then divide (N-1) to Covx=x*x '/(N-1);  The variance is calculated by the covariance [P,v]=eig (COVX);  %%[p,v]=eig (a): The full eigenvalues of the matrix A are obtained, the eigenvalues are V, and the eigenvector of A is composed of the column vectors of P. V=diag (V);        to a diagonal matrix [T,ind]=sort (-V); The value of the values is reversed from small to large, and the index is stored in the IND, which is equivalent to sorting from large to small v=v (IND);         The diagonal Matrix p=p (:, Ind) consisting of the eigenvalues from large to small sorted values;       The characteristic vectors corresponding to the corresponding eigenvalues are found p=p '; Y=p*x;return;




PCA (principal component analysis) MATLAB implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.