"Reprint" principal component Analysis (Principal)-Minimum squared error interpretation

Source: Internet
Author: User

Principal component Analysis (Principal)-Minimum squared error interpretation connected to the previous article3.2 Minimum squared error theory

Assuming that there are two-dimensional sample points (red dots), we are looking at a line so that the variance of the sample points projected onto the line is the largest. The essence is to seek the straight line, then the measure straight line to seek the good, not only the variance maximization method. In retrospect, the linear regression we started with was to find a linear function that would allow the line to fit the sample point optimally, so can we think that the best line is the straight line after the regression? When we return, our least squares measure the distance from the axis of the sample point to the line. For example, in this problem, the characteristic is X, and the class tag is Y. At regression, the least squares measure is the distance d. If the regression method is used to measure the best straight line, then the regression is done directly on the original sample, which has nothing to do with feature selection.

Therefore, we intend to use another method of evaluating straight lines, using the distance d ' from point to line to measure.

There are now N sample points, each of which is m-dimensional (the symbol used in this section is less consistent with the above, and requires a re-understanding of the meaning of the symbol). The projection of the sample point on the line is recorded as, so we are going to minimize

This formula is called the minimum squared error (Least squared error).

In order to determine a straight line, it is generally only necessary to determine a point and to determine the direction.

The first step is to determine the point:

Suppose to find a point in space to represent these n sample points, the word "representative" is not quantified, so to quantify it, we are looking for an m-dimensional point that makes

Minimum. This is the square error evaluation function (Squared-error criterion functions), assuming that m is the mean of n sample points:

Then the square error can be written:

latter and irrelevant, as constants, and, therefore, minimized,

 

Is the sample point mean.

The second step is to determine the direction:

We pull the line from the request (this line is going to be a bit m), assuming that the direction of the line is the unit vector e. Then any point in the line, for example, can be expressed in dots m and e .

 

Which is the distance to the point m .

We redefine the minimum squared error:

The k here is just the equivalent of I. is the minimum squared error function, where the unknown parameter is and e.

is actually the minimum value to be asked. First, you will expand the style:

We first fix e, treat it as a constant, and then take the derivative,

This result means that if you know e, then you will do the inner product with e , you can know the distance of the projection on e from m , but the result is not known.

Then the fixed, the partial derivative of e , we first put the formula (8) into, to

It is similar to the covariance matrix except for the absence of a denominator n-1, which we call the hash matrix (scatter matrix).

The derivative of e can then be obtained, but e needs to be satisfied first, and the Lagrange multiplier is introduced to make the maximum (minimum)

Biased guide

Here are the techniques for vector derivation numbers, which are not introduced here. can go to see some information about matrix calculus, where the derivation can be seen as, will be seen as.

The derivative is equal to 0 o'clock.

Dividing the two sides by n-1 becomes the eigenvalue vector for the covariance matrix.

Starting from different ideas and finally getting the same result, the covariance matrix is calculated as a new coordinate after the eigenvectors are obtained, such as:

This is where the point is gathered around the new axis, because the meaning of the minimum squared error we use is here.

4. The significance of PCA theory

PCA will reduce the n features to K, can be used for data compression, if the 100-dimensional vector can finally be represented by 10-dimensional, then the compression rate is 90%. The KL transform of the same image processing domain uses PCA to do image compression. However, PCA must ensure that the characteristic loss of data is minimized after dimensionality reduction. Look back at the effect of PCA. After PCA processing, two-dimensional data projection to one-dimensional can be in the following situations:

We think the left is good, on the one hand is the projection of the rear difference is the largest, on the one hand point to the straight line distance squared and the smallest, and straight through the sample point of the center point. Why is the projection on the right less effective? Intuition is that because the axes are related so that an axis is removed, the coordinate point cannot be determined by a single axis.

The k axes obtained by PCA are actually K eigenvectors, because the covariance matrix is symmetric, so the K eigenvectors are orthogonal to each other. Look at the following calculation process.

Let's say we're still using a sample, M sample, n features. The eigenvector is e, which represents the 1th dimension of the first eigenvector. Then the original sample characteristic equation can be expressed in the following formula:

The product of the first two matrices is the covariance matrix (divided by M), and the original sample matrix A is the second matrix m*n.

The above formula can be shortened to

The result of our final projection is that E is a matrix of k eigenvectors, unfolded as follows:

The new sample matrix is the projection of M samples to K eigenvectors, and the linear combination of the K eigenvectors. E is orthogonal to each other. As can be seen from matrix multiplication, the transformation of PCA is to drop the original sample point (n dimension) into k orthogonal coordinate system and discard other dimension information. For example, assuming that the universe is n-dimensional (Hawking says 11-dimensional), we get the coordinates of each star in the Milky Way (as opposed to the n-dimensional vectors of the galactic center), but we want to use two-dimensional coordinates to approximate the sample points, assuming that the eigenvalues of the covariance matrix are horizontal and vertical in the graph, Then we suggest the X and Y axes of the galactic Center as the origin, and all the stars projected onto X and y to get the picture below. Yet we discard information such as the distance of each star from us.

5. Summary and Discussion

This part comes from http://www.cad.zju.edu.cn/home/chenlu/pca.htm

One of the great benefits of PCA technology is the dimensionality reduction of data processing. We can sort out the importance of the newly-obtained "principal" vectors, take the most important part of the previous one, and omit the later dimensions to reduce the dimensionality to simplify the model or compress the data. At the same time, the maximum amount of information to maintain the original data.

One of the great advantages of PCA technology is that it is completely free of parameter constraints. In the process of PCA calculation, there is no need of artificial parameters or any empirical model to intervene the calculation, the final result is only related to data, and the user is independent.
However, this can also be seen as a disadvantage. If the user has some prior knowledge of the observed objects and has mastered some of the characteristics of the data, it is not possible to intervene in the process through parameterization and other methods, which may not get the desired effect and the efficiency is not high.

Figure 4: The black dots represent the sampled data and are arranged into the shape of the turntable.
It is easy to imagine that the main element of the data is the rotation angle.

In the example in table 4, the PCA finds the principal element will be. But this is obviously not the optimal and most simplified of the principal elements. There is a nonlinear relationship between them. According to the prior knowledge, the rotation angle is the optimal principal element (analogy polar coordinate). In this case, the PCA will fail. However, if a priori knowledge is added to a certain sort of data, the data can be converted into a linear space. This kind of pre-non-linear transformation of data based on prior knowledge becomes kernel-PCA, which expands the scope of the problem that PCA can deal with and can combine some priori constraints, it is a popular method.

Sometimes the distribution of data does not satisfy the Gaussian distribution. As shown in table 5, the principal elements derived by the PCA method may not be optimal in the case of non-Gaussian distributions. Variance can not be used as a measure of importance when looking for a principal element. To choose the appropriate description of the fully distributed variable according to the distribution of the data, and then based on the probability of distributed

To calculate the correlation of data distributions on two vectors. Equivalently, the orthogonal assumptions between the main elements are maintained, and the principal elements to be searched are also made. This class of methods is called independent principal decomposition (ICA).

Figure 5: The distribution of the data does not satisfy the Gaussian distribution, showing a distinct Doji star shape.
In this case, the direction of the most variance is not the optimal principal element direction.

In addition, PCA can also be used to predict missing elements in matrices.

6. Other References

A Tutorial on Principal-Analysis LI smith–2002

A Tutorial on Principal Component analysis J shlens

Http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/PCAMissingData.pdf

Http://www.cad.zju.edu.cn/home/chenlu/pca.htm

"Reprint" principal component Analysis (Principal)-Minimum squared error interpretation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.