Principal component Analysis (Principal)

Source: Internet
Author: User

The factor analysis is based on the probabilistic model, and the parameters are estimated by using the iterative method of EM algorithm. The principal component analysis (Principal, PCA) only passes the linear change, and uses a few principal components to approximate all variables in order to achieve the goal of reducing the dimension.

First, normalization (normalize)

The goal of normalization is to convert data from different scales to the same scale. The steps for normalization are as follows:

(1) Order;

(2) Replace all;

(3) Order;

(4) Replace all with.

Wherein, step (1) and step (2) Convert the mean value of the data to zero; Step (3) and step (4) Make the data a unit variance, so that the data of different attributes is treated as the same specification.

Step (3) (4) is not a necessary step, and the results obtained from executing step (3) (4) may vary. Andrew Ng argues that if we have a clear prior knowledge of the scale division of a particular attribute, then there is no need to step (3) (4), such as: each pixel of a grayscale image is represented by an element in the collection {0,1,..., 255}.

Ii. examples

For an intuitive understanding of principal component analysis, we take two-dimensional data as an example. Assume that figure 1 describes the data that has been normalized:


Figure 1

Assuming we project the data into a straight line (2), we can see that the point projected on the original point has a large variance.


Figure 2

Assuming we project the data into a straight line (3), we can see that the point projected on the original point has a smaller variance.


Figure 3

Therefore the variation in the straight line direction is very large, but the variation in the straight line is very small. It can be thought that figure 2 is like a flattened plane of approximate straight line, its width is very small, can be approximately ignored. Therefore, it is better to represent the trend of the original point.

Third, the algorithm

The purpose of principal component analysis is to maximize the projected variance of the data points on a particular vector. And we are asking for this vector.

Assuming that each sample has been normalized and the mean is zero, and that there is a unit vector of U, the projection of point x on u can be expressed as. Then we need to make the maximum number of equations projected at each point x on u:


Where the covariance is the sample. On the other hand, we can also think of this as the following optimization problem:


Generate LaGrand Day Operator:

Deviation of the pair:

As can be seen from the above, its essence is to require the eigenvalues of the covariance matrix. Because the covariance matrix is positive definite, it has n eigenvalues, which are sorted from large to small. If the characteristic value is 0, then the corresponding component is not discussed in the statistics. And if the eigenvalues are small, his influence can be ignored. After finding the eigenvalues, the eigenvector corresponding to the eigenvalues can be obtained.

The cumulative contribution rate is introduced below:

The covariance matrix has n sorted eigenvalues, i.e., the contribution rate of the main component of the first k, which is called the cumulative contribution rate of the principal component, and the accumulative contribution rate indicates how much the K principal component can represent the original data. The corresponding vectors are called the first K principal components. We determine the number of principal components by setting a threshold for cumulative contribution rates.

When the value of K is determined, the dimension of the new vector y,y is K, which is less than the dimension n of x, by the linear change of X.

Iv. Application

There are three main functions of principal component analysis:

(1) Data compression

By compressing high-dimensional data into two-or three-dimensional, the data can be visualized, helping users of the data to grasp the characteristics and laws reflected by the data more clearly and intuitively.

(2) Dimension reduction

When the dimension of data is very large, the computation of high dimensional data may consume a lot of computational resources, and the complexity of computation can be avoided by PCA to avoid the occurrence of overfitting phenomenon.

(3) Noise reduction

PCA can also be seen as a noise reduction algorithm: Through PCA, we can find the main characteristics that can represent a whole, and avoid the interference of insignificant data.

V. Summary

The principal component analysis is to find the eigenvalues and eigenvectors corresponding to the covariance matrix of the original data matrix, to sort the eigenvalues by large and small, and then to make the linear transformation according to the characteristic vectors corresponding to the eigenvalues, and get the new vectors (the new vectors are orthogonal to each other). By setting the threshold value, a new vector of the lower dimension can be approximated to represent the original vector of the high dimension (the covariance matrix is non-singular), and if the covariance matrix is singular and the 0 eigenvalues are more, the new vector of the lower dimension can also be used to represent the high-dimensional primitive vectors completely.

Principal component Analysis (Principal)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.