Principal components analysis-maximum variance Interpretation

Source: Internet
Author: User
Tags idf

The previous content of this article is "Factor Analysis". Due to its extraordinary theories, I plan to finish the entire course and then write it again. Before writing this article, I have read PCA, SVD, and lda. These models are similar, but they all have their own characteristics. This article will first introduce PCA. The relationship between them can only be learned and understood. PCA is also called principal factor analysis.

1. Problem

There are always various problems with actual training data:

1. For example, to get a car sample, there are both the maximum speed characteristics measured in "kilometers/hour" and the maximum speed characteristics of "miles/hour, obviously, these two features are redundant.

2. I got a transcript from the end-of-the-period examination of an undergraduate student in the Mathematics Department. There are three columns in the transcript, one is the degree of interest in mathematics, the other is the review time, and the other is the test score. We know that to learn mathematics well, we need to have a strong interest. Therefore, the second item is strongly related to the first item, and the third and second items are also strongly related. Can we merge the first and second items?

3. When we get a sample, there are many features, but there are very few examples. It is very difficult to directly fit the sample using regression, which is easy to over-fit. For example, the house price in Beijing: assuming that the characteristics of the house are (size, location, orientation, school district, construction age, second-hand, number of floors, and number of floors), so many features are made, as a result, there are less than 10 houses. To fit the characteristics of a house-> so many features of a house price will lead to over-fitting.

4. This is a bit similar to the second one. Assume that in the document-term matrix created in IR, there are two word items: "Learn" and "study ", in traditional vector space models, the two are considered independent. However, from the semantic point of view, the two are similar, and their appearance frequency is similar. Can they be combined into a feature?

5. In the signal transmission process, because the channel is not ideal, the signal received by the other end of the channel will be disturbed by noise. How can we filter out the noise?

Review the feature selection in Model Selection and normalization, which we introduced earlier. However, the features to be removed in that article are mainly those irrelevant to class labels. For example, the "Student name" has nothing to do with his "score" and uses the mutual information method.

Many of the features here are related to class labels, but there is noise or redundancy in them. In this case, a feature dimensionality reduction method is required to reduce the number of features, reduce noise and redundancy, and reduce the possibility of over-fitting.

The following describes a principal component analysis (PCA) method to solve some of the above problems. The idea of PCA is to map n-dimensional features to K-dimensional (k <n), which is a new orthogonal feature. This K-dimension feature is called the principal component and is a re-constructed K-dimension feature, rather than simply removing the remaining n-k-dimension features from the n-dimensional feature.

2. PCA calculation process

First, we will introduce the calculation process of PCA:

Assume that the obtained 2D data is as follows:

Rows represent samples, and columns represent features. There are 10 examples, each of which has two features. It can be thought that there are 10 documents, X is the TF-IDF of the "Learn" in 10 documents, and Y is the TF-IDF that "study" appears in 10 documents. We can also think that there are 10 cars, X is the speed of kilometers/hour, Y is the speed of miles/hour, and so on.

Step 1Calculate the average values of X and Y respectively, and then subtract the corresponding mean values from all samples. Here, the mean value of X is 1.81, and the mean value of Y is 1.91. In this example, after the mean value is subtracted, It is (0.69, 0.49 ).

Step 2Returns the feature covariance matrix. If the data is 3 dimensions, the covariance matrix is

Here, only X and Y are available.

The diagonal lines are the variance of X and Y, and the non-diagonal lines are the covariance. If the covariance is greater than 0, it indicates that X and Y increase, and the other increases. If the covariance is smaller than 0, it indicates an increment and a subtraction. If the covariance is 0, the two are independent. The larger the absolute value of covariance, the greater the impact on each other, and the smaller the opposite.

Step 3Returns the covariance feature value and feature vector.

The above are two feature values. The following is the corresponding feature vector. feature value 0.0490833989 corresponds to the feature vector. The feature vectors here are normalized to the unit vector.

Step 4The feature values are sorted in ascending order, the largest K feature vectors are selected, and the corresponding K feature vectors are used as column vectors to form the feature vector matrix.

Here, there are only two feature values. We select the largest one, which is 1.28402771, and the corresponding feature vector is.

Step 5To project the sample points to the selected feature vectors. Assume that the number of samples is m and the number of features is N. After the mean value is subtracted, the sample matrix is dataadjust (M * n), and the covariance matrix is N * n, the matrix composed of K feature vectors is eigenvectors (N * K ). The finaldata after projection is

Here is

Finaldata (10*1) = dataadjust (10*2 matrix) x feature vector

The result is:

In this way, the n-dimensional feature of the original sample is changed to K-dimensional, which is the projection of the original feature on K-dimensional.

The above data can be considered as the combination of learn and study features into a new feature called ls feature, which basically represents these two features.

The preceding process is illustrated in the following figure:

The positive signs represent the pre-processed sample points. The two oblique lines are orthogonal feature vectors (because the covariance matrix is symmetric, the feature vectors are orthogonal ), the last step of matrix multiplication is to project the original sample points to the axis corresponding to the feature vector.

If K is 2, the result is

This is the sample data after PCA processing. The horizontal axis (for example, the LS feature above) can basically represent all the sample points. The whole process looks like rotating the coordinate system. Of course, two-dimensional data can be graphically expressed, and the high-dimensional data won't work. If k = 1, only the horizontal axis is left. All vertices on the axis are projected on the axis.

In this way, the process of PCA is basically over. After the first step of reducing the mean value, there should be another step to normalize the variance of features. For example, one feature is the speed of the car (0 to 100), and the other is the number of seats of the car (2 to 6). Obviously, the second variance is smaller than the first one. Therefore, if this condition exists in the sample feature, calculate the standard deviation of each feature after the first step, and divide the data of each sample under the feature.

To sum up, use the representation method we are familiar with before performing the covariance operation:

There are samples, m in total, and N features in each sample, that is, n-dimensional vectors. Is the J feature of the I-th sample. Is the average value of the sample. Is the standard deviation of the J feature.

The entire PCA process seems simple, that is, finding the feature values and feature vectors of the covariance, and then performing data conversion. But is it amazing why the feature vector for covariance is the most ideal K-dimensional vector? What is the hidden meaning behind it? What is the significance of the entire PCA?

3. Theoretical Basis of PCA

To explain why the feature vector of the covariance matrix is a K-Dimensional Ideal feature, I have seen three theories: maximum variance theory, least error theory, and coordinate axis correlation theory. Here we will briefly discuss the first two methods, and the last one will give a brief overview when discussing the significance of PCA.

3.1 maximum variance theory

In signal processing, the signal has a large variance, while the noise has a small variance. The signal-to-noise ratio is the variance ratio between the signal and the noise. The larger the signal, the better. For example, in the preceding figure, the projection variance of the sample on the horizontal axis is large, and the projection variance on the vertical axis is small. Therefore, the projection on the vertical axis is considered to be caused by noise.

Therefore, we believe that the best K-dimension feature is that after the n-dimensional sample points are converted to K-dimensional, the sample variance on each dimension is very large.

For example, there are five sample points: (we have done preprocessing, the mean is 0, and the feature variance is normalized)

The sample is projected to a one-dimensional dimension, which is expressed in a straight line of the origin point (the pre-processing process essentially refers to moving the origin point to the center point of the sample point ).

Suppose we choose two different straight lines for projection. Which of the two is better? According to our previous variance maximization theory, the good on the left is because the variance between the projected sample points is the largest.

Here we will first explain the concept of projection:

The red points represent the sample, the blue points represent the projection on the U, and the U is the slope of the straight line, the direction vector of the straight line, and the unit vector. The blue point is the projection point on the U, and the distance from the source is (or) because the mean value of each one-dimensional feature of these sample points (samples) is 0, therefore, the mean value of the sample point projected to U (only one distance to the origin) is still 0.

Back to the left graph in the left and right graphs, we need the optimal u to maximize the variance of the projected sample points.

Because the mean value after projection is 0, the variance is:

The middle part is very familiar. Isn't it the covariance matrix of the sample feature? (the mean value is 0. Generally, the covariance matrix is divided by m-1, and m is used here ).

It is used to represent, represent, so the above-style Writing

 

Because U is a unit vector, that is, the two sides of the above formula are left multiplied by u,

That is

We got it! Is the feature value, and U is the feature vector. The best projection line is the feature vector corresponding to the largest feature value, followed by the second largest feature vector, and so on.

Therefore, we only need to decompose the feature values of the covariance matrix. The feature vectors corresponding to the first K large feature values are the best K-dimensional features, and these K-dimensional features are orthogonal. After obtaining the first K u values, the following transformation is used to obtain a new sample.

The J dimension is the projection on the top.

By selecting the largest k u, the features with a small variance (such as noise) are discarded.

This is one of the explanations of PCA, and the second is to minimize errors, which will be introduced in the next article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.