Principal component Analysis (Principal)-Maximum variance interpretation

Source: Internet
Author: User
Tags idf

Reprint Address: http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html

1. Questions

Real training data always has a variety of problems:

1, for example, to get a sample of a car, which has a "kilometer/hour" measurement of the maximum speed characteristics, as well as "miles per hour" of the maximum speed characteristics, obviously these two characteristics have a redundant.

2, get a math department of the undergraduate final exam results list, there are three columns, a column is the degree of interest in mathematics, a column is the review time, there is a list of exam results. We know that to learn maths well, we need to have a strong interest, so the second is strongly related to the first one, the third and the second is also strongly related. Is it possible to merge the first and second items?

3, get a sample, features are very many, and the sample is particularly small, so it is very difficult to use the regression to direct fitting, prone to overfitting. For example, housing prices in Beijing: Suppose the characteristics of the house are (size, position, orientation, whether the school district room, the construction age, whether second-hand, the number of layers, the number of layers), so many characteristics, the result is less than 10 houses of the sample. To fit so many of the characteristics of house features, such as housing prices, will result in overfitting.

4, this is a bit similar to the second one, assuming that in the IR we established the document-term matrix, there are two words "learn" and "study", in the traditional vector space model, the two are considered independent. However, from a semantic point of view, the two are similar, and the frequency of the two are similar, is it possible to become a feature?

5, in the signal transmission process, because the channel is not ideal, the signal received at the other end of the channel will have noise disturbance, then how to filter out the noise?

Review our previous introduction to model selection and regulation, which refers to the problem of feature selection. But the features to be removed in that article are mainly unrelated to the class label. For example, "Student's name" is not related to his "achievement", using the method of mutual information.

Many of the features here are related to class labels, but there is noise or redundancy. In this case, a feature reduction method is required to reduce the number of features, reduce noise and redundancy, and reduce the likelihood of overfitting.

A method called Principal component Analysis (PCA) is discussed below to solve some of the above problems. The idea of PCA is to map n-dimensional features to K-Dimension (K<n), which is a new orthogonal feature. This K-Viterbi is called the principal element, which is a re-constructed K-dimensional feature, rather than simply removing the remaining n-k-dimensional features from the n-dimensional feature.

2. PCA Calculation Process

First, we introduce the calculation process of PCA:

Let's say we get 2-dimensional data like this:

The row represents the sample, the column represents the feature, here are 10 samples, each of the two characteristics. As you can see, there are 10 documents, X is the tf-idf,y of "learn" in the 10 documents, and the TF-IDF that appeared in the 10 documents "study". It can also be thought that there are 10 cars, X is km/h speed, Y is the speed of miles per hour, and so on.

The first step is to find the average of x and Y, and then subtract the mean value for all the examples. Here the mean value of x is the mean value of 1.81,y is 1.91, then the sample minus the mean is (0.69,0.49), get

The second step is to find the characteristic covariance matrix, if the data is 3 dimensions, then the covariance matrix is

There's only X and Y, and the solution is

On the diagonal are the variances of x and Y, which are covariance on the non-diagonal. A covariance greater than 0 means that x and Y have one increment and the other increases; less than 0 indicates an increment, a minus, and a covariance of 0 o'clock, both independent. The greater the absolute value of the covariance, the greater the effect of the two on each other, and the smaller the inverse.

In the third step , the eigenvalues and eigenvectors of the covariance are obtained.

The above is two eigenvalues, the following is the corresponding eigenvectors, the eigenvalues of 0.0490833989 corresponding to the eigenvector, where the eigenvectors are normalized to the unit vector.

In the fourth step , the eigenvalues are sorted in order from large to small, the largest k is selected, and the corresponding K eigenvectors are respectively used as the column vectors to form the eigenvector matrix.

Here the eigenvalues are only two, we choose the largest one, here is 1.28402771, the corresponding eigenvector is.

In the fifth step , the sample points are projected onto the selected eigenvectors. Suppose the sample number is m, the characteristic number is n, the sample matrix after subtracting the mean is Dataadjust (m*n), and the covariance matrix is n*n, the matrix of the selected K eigenvectors is eigenvectors (n*k). Then the projected data FinalData to

Here is

FinalData (10*1) = Dataadjust (10*2 matrix) x eigenvectors

Get the result is

In this way, the N-dimensional feature of the original sample is changed to K-Dimension, which is the projection of the original feature on the K-dimension.

The above data can be thought of as learn and study feature fusion as a new feature called LS feature, which basically represents these two characteristics.

The above procedure has a diagram description:

The positive sign indicates the pre-processed sample point, the diagonal two lines are orthogonal eigenvectors (because the covariance matrix is symmetric, so its eigenvector is orthogonal), and the last step of matrix multiplication is the projection of the original sample points to the corresponding axes of the eigenvectors.

If the k=2 is taken, then the result is

This is the sample data after PCA processing, the horizontal axis (above, for example, LS feature) can basically represent all the sample points. The whole process looks like the coordinate system is rotated, of course, the two dimensions can be graphically expressed, high-dimensional is not. Above if k=1, then only will leave here the horizontal axis, on the axis is all points in this axis projection.

This process of PCA basically ends. After the first step minus the mean, there should be another step in the variance normalization of features. For example, a characteristic is the car speed (0 to 100), one is the number of seats (2 to 6) of the car, and obviously the second one is smaller than the first one. Therefore, if this is the case in the sample feature, after the first step, the standard deviation for each feature is obtained, and then the data for each sample under that feature is divided.

To summarize, using the notation that we used to be familiar with, the steps before covariance are:

It is a sample, a total of M, each sample n characteristics, that is, n-dimensional vector. Is the first j characteristic of the example I sample. is the sample mean. is the standard deviation of the J feature.

The whole PCA process seems to be simple, which is to find the eigenvalues and eigenvectors of covariance, and then do the data transformation. But is it surprising that the eigenvector of covariance is the ideal k-dimensional vector? What is the hidden meaning behind it? What is the meaning of the whole PCA?

3. Basis of PCA theory

To explain why covariance matrices feature vectors are k-dimensional ideal features, I see three theories: The maximum variance theory, the minimum error theory and the axis correlation theory respectively. The first two are briefly discussed here, and the last is a brief overview of the significance of PCA.

3.1 Maximum Variance theory

In signal processing, the signal has a large variance, the noise has a small variance, the signal-to-noise ratio is the ratio of the variance between the signals and the noise, the larger the better. As shown in the preceding figure, the projection variance on the horizontal axis is large, and the projection variance on the longitudinal axes is small, so the projection on the longitudinal axis is caused by noise.

Therefore, we believe that the best K-dimensional feature is to convert n-dimensional sample points to K-dimensions, and the sample variance on each dimension is very large.

For example, there are 5 sample points: (pre-treatment, mean value 0, feature variance is one)

The sample is projected onto a dimension here, represented by a straight line over the origin (the process of pre-processing essentially moves the origin to the center point of the sample point).

Suppose we choose two different lines to do the projection, then which of the two bars in the left and right? According to our previous variance maximization theory, the left side is good because the variance between the sample points after the projection is the largest.

Here, let's explain the concept of projection:

The red dots represent the sample, the blue dots represent the projections on the U, the slope of the line is also the direction vector of the line, and the unit vector. The blue point is the projection point on the U, the distance from the origin is (that is, or) because each dimension of these sample points (the sample) has a mean value of 0, so the mean value of the sample point projected onto U (only one to the origin) is still 0.

Back to the left of the image above, we ask for the best u, so that the sample point variance is the largest after projection.

Since the average value is 0 after the projection, the variance is:

The middle part is familiar, is not the covariance matrix of the sample characteristics (the mean value is 0, the general covariance matrix is divided by m-1, here with M).

Used to express, Express, then write on

 

Since U is a unit vector, that is, both sides of the upper-side are left multiply u,

That

We got it! is the characteristic value, U is the eigenvector. The best projection line is the eigenvector corresponding to the maximum eigenvalue, followed by the second largest corresponding eigenvector, and so on.

Therefore, we only need to carry on the eigenvalue decomposition to the covariance matrix, the characteristic vectors corresponding to the former K-valued eigenvalues are the best K-restoration features, and the K-restoration feature is orthogonal. After the first k U is obtained, a new sample can be obtained by the following transformations.

The first J dimension is the projection on the top.

By selecting the largest K u, features with less variance (such as noise) are discarded.

This is one of the explanations for PCA, and the second is error minimization http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html.

Principal component Analysis (Principal)-Maximum variance interpretation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.