pca--principal component Analysis (Principal)

Source: Internet
Author: User

problem

1, for example, to get a sample of a car, which has a "kilometer/hour" measurement of the maximum speed characteristics, as well as "miles per hour" of the maximum speed characteristics, obviously these two characteristics have a redundant.

2, get a math department of the undergraduate final exam results list, there are three columns, a column is the degree of interest in mathematics, a column is the review time, there is a list of exam results. We know that to learn maths well, we need to have a strong interest, so the second is strongly related to the first one, the third and the second is also strongly related. Is it possible to merge the first and second items?

3, get a sample, features are very many, and the sample is particularly small, so it is very difficult to use the regression to direct fitting, prone to overfitting. For example, housing prices in Beijing: Suppose the characteristics of the house are (size, position, orientation, whether the school district room, the construction age, whether second-hand, the number of layers, the number of layers), so many characteristics, the result is less than 10 houses of the sample. to fit so many features of house features ‐> prices, it will result in overfitting.

4, this is a bit similar to the second one, assuming that in the IR we created the document-word item matrix, there are two terms "learn" and "study", in the traditional vector space model, the two are considered independent. However, from a semantic point of view, the two are similar, and the frequency of the two are similar, is it possible to become a feature?

5, in the signal transmission process, because the channel is not ideal, the signal received at the other end of the channel will have noise disturbance, then how to filter out the noise?

    • Removing features unrelated to the label , such as "Student's name", is irrelevant to his "achievement" and uses the method of mutual information .
    • The rejection is related to the class label, but there is a noise or redundancy characteristic . In this case, a feature reduction method is required to reduce the number of features, reduce noise and redundancy, and reduce the likelihood of overfitting.
the idea of PCA

The n-dimensional features are mapped to K-dimensional (k<n), which is a completely new orthogonal feature. This K-Viterbi is called the principal element, which is a re-constructed K-dimensional feature, rather than simply removing the remaining n‐k-dimensional features from the n-dimensional feature.

Maximum variance theory, least square error theory, and axis correlation degree theory

PCA Calculation Process

Let's say we get 2-dimensional data like this:

The row represents the sample, the column represents the feature , there are 10 samples, and two characteristics for each sample.

The first step is to find the average of x and Y respectively, and then subtract the mean value for all the examples.

The mean value of x here is the mean value of 1.81,y is 1.91, minus the resulting

If there is a significant difference in variance between the sample features, the variance of the features needs to be normalized (omitted). the standard deviation σ for each feature is then divided by σ for each sample under that feature.

The second step is to find the characteristic covariance matrix.

If the data is 3 dimensions, then the covariance matrix is

There's only X and Y, and the solution is

Note: The variance of x and Y, respectively, on the diagonal is the covariance on the non-diagonal.

L covariance > 0 o'clock, indicates that x and Y have one increment and the other increases;

L Covariance < 0 o'clock, indicating an increase, a minus;

L covariance = 0 o'clock, the two are independent.

L The greater the absolute value of the covariance, the greater the effect on each other, the smaller the inverse.

To solve the covariance process:

  

In the third step, the eigenvalues and eigenvectors of the covariance are obtained.

The eigenvectors here are normalized to the unit vector .

In the fourth step, the eigenvalues are sorted in order from large to small, the largest k is selected, and the corresponding K eigenvectors are respectively used as the column vectors to form the eigenvector matrix.

In the fifth step, the sample points are projected onto the selected eigenvectors.

Suppose the sample number is m, the characteristic number is n, the sample matrix after subtracting the mean is Dataadjust (m*n), and the covariance matrix is n*n, the matrix of the selected K eigenvectors is eigenvectors (n*k).

Then the projected data is

In this way, the N-dimensional feature of the original sample is changed to K-Dimension, which is the projection of the original feature on the K-dimension.

This case makes k=1, and results are obtained:

The basis of PCA theoryTheory of maximum variance

In signal processing, the signal has a large variance, the noise has a small variance, the signal-to-noise ratio is the ratio of the variance between the signals and the noise, the larger the better.

As shown, the projection variance of the sample on the horizontal axis is large, the projection variance on the longitudinal axes is small, so the projection on the longitudinal axis is caused by noise. Therefore, we think that the best K-dimensional feature is to convert n-dimensional sample points to K-dimensions, and the sample variance on each dimension is very large.

For the 5 sample points, let's say we choose two different lines to do the projection. According to the variance maximization theory, the left side is good because the variance between the sample points after the left projection is the largest.

Projection

1) Red dot indicates sample

2) The blue point is the projection point on the U, the distance from the origin is <> (even if).

3) U is the slope of the straight line, the direction vector of the line, and is the unit vector.

4) The mean value of each dimension feature of the sample point (sample) is equal to the mean of the sample point projected onto U.

The best projection vector u can make the sample point variance maximum after projection.

In this case, the known mean value is 0, so the variance is

Therefore, λ is the eigenvalues of Σ, and U is the eigenvector. The best projection line is the characteristic vector that corresponds to the maximum value of the eigenvalue λ. We only need to carry on the eigenvalue decomposition to the covariance matrix, the characteristic vectors corresponding to the former K-valued eigenvalues are the best K-restoration features, and the K-restoration feature is orthogonal.

The new sample obtained is:, the J dimension of which is the projection on the upper part.

By selecting the largest K u, features with less variance (such as noise) are discarded.

The theory of least square error

pca--principal component Analysis (Principal)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.