Principal components analysis-Least Square Error Interpretation

Source: Internet
Author: User
Next Theory of least square error of 3.2

Let's assume that there is such a two-dimensional sample point (red point). Looking back, we have discussed a straight line to maximize the variance of the points projected from the sample point to the straight line. The essence is to find a straight line, so the measurement of a straight line is good, not only the method of maximizing variance. Let's look back at the linear regression we first learned. The goal is to find a linear function so that the straight line can best fit the sample points. Can we think that the best straight line is the linear line after the regression? In regression, our least square method measures the coordinate axis distance from the sample point to a straight line. For example, in this problem, the feature is X and the class label is Y. In regression, the least square method measures the distance d. If the regression method is used to measure the best straight line, the regression is directly performed on the original sample, which has nothing to do with feature selection.

Therefore, we plan to use another method to evaluate the quality of a straight line, and use the distance from a point to a straight line D' for measurement.

There are now n sample points, each of which is m-dimensional (the symbols used in this section are inconsistent with the above, and the meaning of the symbols needs to be re-understood ). We need to minimize the projection of sample points on a straight line.

This formula is called least squared error ).

To determine a straight line, you only need to determine a point and a direction.

Step 1:

Assume that we need to find a point in the space to represent the N sample points. The word "representative" is not quantified. Therefore, if we want to quantify it, we need to find a m-dimensional point so that

Minimum. Here is the squared-error criterion function.MMean of N sample points:

So square errors can be written:

The latter item is irrelevant and is regarded as a constant. Therefore, when minimized,

 

Is the mean of sample points.

Step 2 determine the direction:

We pull the required straight line (this line is over pointM). Assume that the direction of the line is the unit vector.E. So any point in a straight line, for example, points can be used.MAndETo indicate

 

Point-to-PointM.

We redefine the least square error:

Here, K is equivalentI. Is the least square error function, where the unknown parameter isE.

It is actually the minimum value. First, expand the above formula:

First, we fixedE, Think of it as a constant, and then evaluate and obtain

This result means that if you knowE, AndEDo the inner product, you can know inEProjection off onMBut you do not need to know the result.

Then it is fixed, rightETo obtain the partial derivative, we first substitute the formula (8) to obtain 

Similar to the covariance matrix, only the denominator N-1 is missing.Hash Matrix(Scatter matrix ).

Then you canEReturns the partial derivative,EFirst, we need to meet the requirements. We need to introduce the Laplace multiplier to make the maximum (minimum) and

Partial Guidance

There is a technique to evaluate the derivative of a vector. The method is not described here. Let's take a look at some information about matrix calculus, which can be viewed as, or as, matrix calculus.

When the derivative is equal to 0

Divide the two sides by n-1, and then calculate the feature value vector for the covariance matrix.

Starting from different ideas, we finally get the same result, find the feature vector for the covariance matrix, and then the feature vector becomes a new coordinate, such:

At this time, the points are all gathered around the new axis, because the minimum square error we use is meaningful here.

4. Theoretical Significance of PCA

PCA reduces n features to K for data compression. If the 100 dimension vector can be expressed in 10 dimensions, the compression rate is 90%. In the same image processing field, KL transformation uses PCA for image compression. However, PCA must minimize the loss of data features after dimensionality reduction. Let's take a look at the PCA effect. After PCA processing, two-dimensional data can be projected onto one-dimensional data in the following situations:

We think that the Left graph is good. On the one hand, the variance after projection is the largest, on the other hand, the sum of squares between the distance from the point to the straight line is the smallest, and the center point of the straight line over the sample point. Why is the projection effect on the right relatively poor? Intuition is because of the correlation between the coordinate axes, so removing one coordinate axis will make the coordinate points cannot be determined by a single coordinate axis.

The K coordinate axes obtained by PCA are actually K feature vectors. Due to the symmetry of the covariance matrix, K feature vectors are orthogonal. Take a look at the following calculation process.

Let's assume that we are still used to represent samples, m samples, and N features. Feature vector isEIndicates the 1st dimension of the I-th feature vector. The original sample feature equation can be expressed in the following formula:

The product of the first two matrices is the covariance matrix (divided by m). The original sample matrix A is the second matrix m * n.

The preceding format can be abbreviated

The projection result is that E is a matrix composed of K feature vectors:

The new sample matrix is the projection from m samples to k feature vectors, which is also a linear combination of these K feature vectors. E is orthogonal. From the matrix multiplication, we can see that the PCA transformation is to project the original sample points (n-dimensional) to k orthogonal coordinate systems, and discard information from other dimensions. For example, if the universe is n-dimensional (Hawking said 11-dimensional), we can obtain the coordinates of each star in the Galaxy (n-dimensional vector relative to the center of the galaxy ), however, we want to use two-dimensional coordinates to approach these sample points. Assuming that the feature vectors of the calculated covariance matrix are the horizontal and vertical directions in the graph, we recommend that you use the X and Y axes at the center of the galaxy. All the stars are projected onto X and Y to get the following picture. However, we discard information such as the distance between each star and us.

5. Summary and discussion

This part comes from http://www.cad.zju.edu.cn/home/chenlu/pca.htm

One of the major advantages of PCA technology is the Dimensionality Reduction of data. We can sort the importance of the newly obtained "principal element" vector, and take the most important part as needed to save the dimension, it can reduce the dimension to simplify the model or compress the data. At the same time, the original data information is maintained to the maximum extent.

One of the major advantages of PCA technology is that it has no parameter restrictions at all. In the calculation process of PCA, no manual parameter setting or calculation intervention is required based on any empirical model. The final result is only related to data and independent from the user.
However, this can also be seen as a disadvantage. If you have a certain amount of prior knowledge on the observed object and master some features of the data, but cannot intervene in the processing process through parameterization or other methods, the expected results may not be achieved, low efficiency.

Figure 4: A black dot indicates the sample data, which is arranged as a turntable.
It is easy to imagine that the principal component of the data is or the rotation angle.

In the example in table 4, the principal component identified by PCA is. But this is obviously not the best and most simplified principal component. There is a non-linear relationship between them. Based on prior knowledge, we can see that the rotation angle is the optimal Principal Component (similar to polar coordinates ). In this case, PCA becomes invalid. However, if we add a prior knowledge to classify the data, we can convert the data into a linear space. This kind of Non-Linear conversion method based on prior knowledge becomesKernel-PCA, which extends the scope of problems that PCA can handle and can combine with some prior constraints, is a popular method.

Sometimes the data distribution does not satisfy the Gaussian distribution. As shown in table 5, the principal component obtained by the PCA method may not be optimal in the case of non-Gaussian distribution. When looking for the primary element, the variance cannot be used as the criterion to measure the importance. Select a variable that describes the full distribution according to the data distribution, and then distribute the data according to the probability.

To calculate the correlation between the two vectors. Equivalent, the orthogonal hypothesis between the principal components is maintained, and the principal components to be searched must also be used. This method is called independent Principal Component Decomposition (ICA ).

Figure 5: the distribution of data does not meet the Gaussian distribution, and is displayed as a cross star.
In this case, the direction of the maximum variance is not the direction of the optimal principal component.

In addition, PCA can also be used to predict the missing elements in a matrix.

6. Other References

ATutorialOn principal components analysis Li Smith-2002

ATutorialOn Principal Component Analysis J shlens

Http://www.cmlab.csie.ntu.edu.tw /~ Cyy/learning/tutorials/Pcamissingdata.pdf

Http://www.cad.zju.edu.cn/home/chenlu/pca.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.