Principal component Analysis (PCA)

Last Update:2015-09-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PCA, principal component analysis

Principal component analysis is mainly used for dimensionality reduction of data. The dimensions of the data features in the raw data may be many, but these characteristics are not necessarily important, and if we can streamline the data features, we can reduce the storage space and possibly reduce the noise interference in the data.

For example: Here is a set of data, as shown in table 1 below
2.5 1.2-2.3-2.8-1 0.3

3.3 0.8-1.8-2.5-0.8 0.2

Each column represents a set of data, each line represents a feature, there are 2 characteristics, that is, the data is 2-dimensional, we have to do as much as possible to reduce the complexity of the data, but also to ensure that the information in the data is not lost.

The 2-dimensional data in the coordinate system, we found that if the red axis as the x-axis (y-axis and x-axis perpendicular), then the data coordinates can be changed to the value of the x-axis value is larger, and the y-axis direction value to close, can also be described by variance, the x-axis direction of the variance is small, So if a small amount of information is allowed to lose, leaving only the value of the x-axis direction, then the original data from 2 to 1-dimensional, this is the principal component analysis of the main principle, this shows that the main factor is "axis transformation", the original 2-dimensional data "projection" to this 1-dimensional axis, then the key is how to find Axes ".

Intuitively understand, because we want to keep information as much as possible, so in the direction perpendicular to this "new axis" the distribution of data points should try to "flat" some, in other words, the data points on the "new axis" distribution as far as possible scattered, in the language of mathematics is to make the data points on the "new axis" projection The variance is as large as possible. In this way, the different two points are not overlapping after the projection, and the information is kept to a maximum. Dimensions with large variance are preserved and dimensions with smaller variance are ignored to reduce dimensionality. The variance formula is given below:

Is that enough for you?

We consider the case where a set of data is reduced from 3 to 2 dimensions. That is to find 2 "new axis", then how to choose, before we know, we should try to make the data in the direction of the new axis projection as far as possible, that is, the variance as large as possible. At the same time the two axes must be orthogonal (you can be understood as vertical).

Why does it have to be this way?

Because if two axes are not perpendicular to each other, then one of the axes can be decomposed along another axis, so that the "information" facts expressed in these two axes are overlapping, so that the "information" expressed by the two axes is independent of each other, that is, the "new axis" is orthogonal.

This is expressed in mathematical language is all the data in two "new axis" projection of the covariance of 0, that is, about the characteristics of two line vector covariance of 0. This "new axis" is also called the base, and this axis transformation is also called the base transformation .

The following is the covariance formula:

(x, Y represents data for two dimensions, which is equivalent to two rows in table 1)

　　To summarize , there are two aspects:
1) The variance of the line vectors is as large as possible, Cov (x,x) Cov (y,y) (variance)
2) The covariance between line vectors is 0 Cov (x, y) Cov (y,x)
A column vector is a vector of values in the "new axis" projection after the data has been reduced to dimension.

We combine these two aspects of a matrix, which is called the covariance matrix, our goal is to make its diagonal other than the covariance is 0, the diagonal to keep the large variance of the dimension (ignoring the smaller variance of the dimension, to reduce the dimension).

So how does this covariance matrix get quickly?

By the formula of covariance we can get the original matrix each row minus the mean of the row, get the matrix A, and then the matrix a multiplied by the transpose of matrix A, and finally divided by N-1 (n is the number of matrix rows)

So how do you get the "new Axis" (a group of bases) and the original data in the "New axis" representation?
(Here's a quote from Zhang Yang's blog, because he's really wonderful.)
Set the original data matrix x corresponds to the covariance matrix C, and P is a set of base-by-row matrix, set y=px, Y is x to p to do the base transformation of the data (Y is the compressed data). With the covariance matrix of y as D, we derive the relationship between D and C:

Now things are clear! The p we're looking for is not something else, but a p that can make the original covariance matrix diagonal. In other words, the optimization target becomes the search for a matrix p, satisfies (the formula) is a diagonal matrix, and the diagonal elements are arranged from large to small, then the first k line of P is the base to be searched, the matrix of P's former K-line is multiplied by x to reduce the x from N to K-dimension and satisfy the optimization condition.

(To see more detailed explanations, please visit http://blog.codinglabs.org/articles/pca-tutorial.html)

As for the method of matrix p, because the covariance matrix C is a real symmetric matrix, only the eigenvalues and corresponding eigenvectors of the covariance matrix C are required, and the eigenvectors are arranged into matrices according to the corresponding eigenvalue size from top to bottom, and the former K-line composition matrix P

Let's first introduce the principle of principal component analysis, and then experience the computational process (using MATLAB).

First initialize the matrix

Matrix minus the mean value of each row

Calculate covariance matrix C

Solving the eigenvectors and eigenvalues of covariance matrices

Then P = eigenvectors

Since 38.0823>0.5460

Preserve the second eigenvector (-0.7175,-0.6966)

Calculate the compressed data y

Reference blog:

Http://blog.codinglabs.org/articles/pca-tutorial.html

Http://www.360doc.com/content/14/0526/06/15831056_380900310.shtml

Principal component Analysis (PCA)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Principal component Analysis (PCA)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Principal component Analysis (PCA)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support