Machine learning Combat Bymatlab (ii) PCA algorithm

Source: Internet
Author: User

PCA algorithm is also called Principal component Analysis (principal), which is mainly used for data dimensionality reduction.

Why is data dimensionality reduced? Because of the fact that our training data can be characterized by too many features or a cumbersome problem, such as:

    • A sample data about the car, one characteristic is "the maximum speed characteristic of km/h" and the other is the maximum speed characteristic of "mph", which obviously has a strong correlation between the two characteristics.
    • Get a sample, features very much, sample missing very little, such data with the return to you and will be very difficult, it is easy to lead to overfitting

PCA algorithm is used to solve this problem, the core idea is to map the n-dimensional features to the K-dimensional (K < n), which is a new orthogonal feature of K-dimension. Instead of simply extracting the remaining n-k-dimensional features from the n-dimensional feature, we make this k-dimensional the principal element, which is a re-constructed K-dimensional feature.

The computational process of PCA

Suppose we get 2-dimensional data as follows:


The row represents the sample, the column represents the feature, here are 10 examples, each sample has 2 characteristics, we assume that these two characteristics are strong correlation, we need to reduce the dimension.

The first step is to find the average of x and Y, and subtract the corresponding mean from all the examples.

Here the mean value of x is 1.81, the mean value of y is 1.91, minus the mean value to get the following data:


Note that at this point we generally should be in the variance of the characteristics of the normalization, the purpose is to make each feature the same weight, but because our data values are relatively close, so the normalization of this step can be ignored

The first step of the algorithm steps are as follows:


In this example, steps 3 and 4 are not done.

Step two: Finding the characteristic covariance matrix

The formula is as follows:


The third step: solving the eigenvalues and eigenvectors of the covariance matrix




The fourth step: sort the eigenvalues from large to small, select the largest of the K, and then make the corresponding K eigenvectors as the column vectors. Feature matrix

Here there are only two eigenvalues, we choose the largest one, which is: 1.28402771, the corresponding eigenvector is:


Note: When solving the covariance matrix by the EIG function of MATLAB, the returned eigenvalues are a diagonal matrix with eigenvalues distributed diagonally, and the first I eigenvalues correspond to the eigenvectors of column I

Fifth step: Projecting the sample points onto the selected eigenvectors

Assuming that the sample column number is M, the characteristic number is n, minus the mean value after the sample matrix is Dataadjust (m*n), the covariance matrix is n*n, the K feature vector is selected after the matrix is eigenvectors (n*k), the projected data FinalData is:

FinalData (m*k) = Dataadjust (m*n) X eigenvectors (n*k)

The resulting results are:


In this way, we have reduced the N-dimensional feature to K-dimension, which is the projection of the original feature on the K-dimension.

The process of PCA seems to be very simple, that is, to find the eigenvalues and eigenvectors of covariance, and then do the data transformation. But why is the covariance eigenvector the ideal k-dimensional vector? This problem is explained by the theoretical basis of PCA.

The theoretical basis of PCA

The characteristic vectors of covariance are the K-dimensional ideal features, and there are 3 theories, namely:

  1. Theory of maximum Variance
  2. Minimum error theory
  3. The theory of axis correlation

Here is a brief description of the maximum variance theory:

Theory of maximum Variance

In signal processing, the signal has a large variance, the noise has a small variance, the signal-to-noise ratio is the ratio of the variance between the signals and the noise, the larger the better. Therefore, we think that the best k is characterized by the conversion of n-dimensional sample points to K-dimensional, the sample variance on each dimension is very large

The PCA processing plots are as follows:


After reduced dimension conversion:

In the line is the eigenvector we selected, the above example of the PCA process is the space of 2-dimensional points projected into the line.

So the question comes, both images are the result of PCA, which one is better?

According to the maximum variance theory, the answer is the left figure, in fact, the sample projection after a large interval, easy to distinguish.

In fact, from another perspective, the left side of the graph each point line distance absolute sum than the right of each point to the straight line distance absolute sum small, is not a little curve regression feeling? In fact, from this point of view, this is the minimum error theory: Select the minimum error after projection line .

Back to the top left, which is the best u we ask for, the best u is the best curve, it can make the sample variance of the projection the most or the smallest error.

In addition, since the first step of the PCA algorithm we have performed on the sample data for each dimension of the mean, and each data minus the preprocessing of the mean, so each feature is now the mean value of 0, projected onto the eigenvector, the mean value is also 0. The variance is therefore:

The middle part of the final equation is actually the covariance matrix of the sample variance (the mean value of Xi is 0)


Since U is a unit vector, get


On both sides of the pain multiplied by u, get:


So we get

The best projection line is the characteristic value λ is the corresponding eigenvector, followed by the second largest corresponding eigenvector (the solution to the eigenvectors are orthogonal). Which λ is our variance, also corresponds to our previous maximum variance theory, that is, to find a projection can make the most difference between the line.

Matlab implementation

 function [Lowdata,reconmat] = PCA(data,k) [Row, col]=size(data); Meanvalue = mean (data);%vardata = var (data,1,1);Normdata = data-Repmat(Meanvalue,[Row,1]); Covmat = CoV (Normdata (:,1), Normdata (:,2));The covariance matrix is obtained by the%[Eigvect,eigval]= Eig (Covmat);% to extract eigenvalues and eigenvectors[Sortmat, Sortix]= Sort (Eigval,' descend ');[B,ix]= Sort (Sortmat (1,:),' descend '); len = min (K,length(IX)); Eigvect (:, IX (1:1: Len)); lowdata = Normdata * Eigvect (:, IX (1:1: Len)); Reconmat = (Lowdata * Eigvect (:, IX (1:1: Len))') + Repmat (meanvalue,[row,1]); % converts the reduced-dimension data to the new space end

Invocation mode

function testPCA%%clcclearclose all%%‘testSet.txt‘1;data = load(filename);[lowData,reconMat] = PCA(data,K);figurescatter(data(:,1),data(:,2),5,‘r‘)hold onscatter(reconMat(:,1),reconMat(:,2),5)hold offend



Machine learning Combat Bymatlab (ii) PCA algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.