Machine learning Combat Bymatlab (ii) PCA algorithm

Last Update:2015-04-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PCA algorithm is also called Principal component Analysis (principal), which is mainly used for data dimensionality reduction.

Why is data dimensionality reduced? Because of the fact that our training data can be characterized by too many features or a cumbersome problem, such as:

A sample data about the car, one characteristic is "the maximum speed characteristic of km/h" and the other is the maximum speed characteristic of "mph", which obviously has a strong correlation between the two characteristics.
Get a sample, features very much, sample missing very little, such data with the return to you and will be very difficult, it is easy to lead to overfitting

PCA algorithm is used to solve this problem, the core idea is to map the n-dimensional features to the K-dimensional (K < n), which is a new orthogonal feature of K-dimension. Instead of simply extracting the remaining n-k-dimensional features from the n-dimensional feature, we make this k-dimensional the principal element, which is a re-constructed K-dimensional feature.

The computational process of PCA

Suppose we get 2-dimensional data as follows:

The row represents the sample, the column represents the feature, here are 10 examples, each sample has 2 characteristics, we assume that these two characteristics are strong correlation, we need to reduce the dimension.

The first step is to find the average of x and Y, and subtract the corresponding mean from all the examples.

Here the mean value of x is 1.81, the mean value of y is 1.91, minus the mean value to get the following data:

Note that at this point we generally should be in the variance of the characteristics of the normalization, the purpose is to make each feature the same weight, but because our data values are relatively close, so the normalization of this step can be ignored

The first step of the algorithm steps are as follows:

In this example, steps 3 and 4 are not done.

Step two: Finding the characteristic covariance matrix

The formula is as follows:

The third step: solving the eigenvalues and eigenvectors of the covariance matrix

The fourth step: sort the eigenvalues from large to small, select the largest of the K, and then make the corresponding K eigenvectors as the column vectors. Feature matrix

Here there are only two eigenvalues, we choose the largest one, which is: 1.28402771, the corresponding eigenvector is:

Note: When solving the covariance matrix by the EIG function of MATLAB, the returned eigenvalues are a diagonal matrix with eigenvalues distributed diagonally, and the first I eigenvalues correspond to the eigenvectors of column I

Fifth step: Projecting the sample points onto the selected eigenvectors

Assuming that the sample column number is M, the characteristic number is n, minus the mean value after the sample matrix is Dataadjust (m*n), the covariance matrix is n*n, the K feature vector is selected after the matrix is eigenvectors (n*k), the projected data FinalData is:

FinalData (m*k) = Dataadjust (m*n) X eigenvectors (n*k)

The resulting results are:

In this way, we have reduced the N-dimensional feature to K-dimension, which is the projection of the original feature on the K-dimension.

The process of PCA seems to be very simple, that is, to find the eigenvalues and eigenvectors of covariance, and then do the data transformation. But why is the covariance eigenvector the ideal k-dimensional vector? This problem is explained by the theoretical basis of PCA.

The theoretical basis of PCA

The characteristic vectors of covariance are the K-dimensional ideal features, and there are 3 theories, namely:

Theory of maximum Variance

Minimum error theory

The theory of axis correlation

Here is a brief description of the maximum variance theory:

Theory of maximum Variance

In signal processing, the signal has a large variance, the noise has a small variance, the signal-to-noise ratio is the ratio of the variance between the signals and the noise, the larger the better. Therefore, we think that the best k is characterized by the conversion of n-dimensional sample points to K-dimensional, the sample variance on each dimension is very large

The PCA processing plots are as follows:

After reduced dimension conversion:

In the line is the eigenvector we selected, the above example of the PCA process is the space of 2-dimensional points projected into the line.

So the question comes, both images are the result of PCA, which one is better?

According to the maximum variance theory, the answer is the left figure, in fact, the sample projection after a large interval, easy to distinguish.

In fact, from another perspective, the left side of the graph each point line distance absolute sum than the right of each point to the straight line distance absolute sum small, is not a little curve regression feeling? In fact, from this point of view, this is the minimum error theory: Select the minimum error after projection line .

Back to the top left, which is the best u we ask for, the best u is the best curve, it can make the sample variance of the projection the most or the smallest error.

In addition, since the first step of the PCA algorithm we have performed on the sample data for each dimension of the mean, and each data minus the preprocessing of the mean, so each feature is now the mean value of 0, projected onto the eigenvector, the mean value is also 0. The variance is therefore:

The middle part of the final equation is actually the covariance matrix of the sample variance (the mean value of Xi is 0)

Since U is a unit vector, get

On both sides of the pain multiplied by u, get:

So we get

The best projection line is the characteristic value λ is the corresponding eigenvector, followed by the second largest corresponding eigenvector (the solution to the eigenvectors are orthogonal). Which λ is our variance, also corresponds to our previous maximum variance theory, that is, to find a projection can make the most difference between the line.

Matlab implementation

 function [Lowdata,reconmat] = PCA(data,k) [Row, col]=size(data); Meanvalue = mean (data);%vardata = var (data,1,1);Normdata = data-Repmat(Meanvalue,[Row,1]); Covmat = CoV (Normdata (:,1), Normdata (:,2));The covariance matrix is obtained by the%[Eigvect,eigval]= Eig (Covmat);% to extract eigenvalues and eigenvectors[Sortmat, Sortix]= Sort (Eigval,' descend ');[B,ix]= Sort (Sortmat (1,:),' descend '); len = min (K,length(IX)); Eigvect (:, IX (1:1: Len)); lowdata = Normdata * Eigvect (:, IX (1:1: Len)); Reconmat = (Lowdata * Eigvect (:, IX (1:1: Len))') + Repmat (meanvalue,[row,1]); % converts the reduced-dimension data to the new space end

Invocation mode

function testPCA%%clcclearclose all%%‘testSet.txt‘1;data = load(filename);[lowData,reconMat] = PCA(data,K);figurescatter(data(:,1),data(:,2),5,‘r‘)hold onscatter(reconMat(:,1),reconMat(:,2),5)hold offend

Machine learning Combat Bymatlab (ii) PCA algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning Combat Bymatlab (ii) PCA algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine learning Combat Bymatlab (ii) PCA algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support