PCA Singular Value Decomposition

Source: Internet
Author: User

Singular Value and principal component analysis (PCA)

[Reprint] Original Source: http://blog.sina.com.cn/s/blog_4b16455701016ada.html

 

The PCA problem is actually a base transformation, which makes the transformed data have the largest variance. The variance size describes the amount of information about a variable. When we talk about the stability of a thing, we often say that we need to reduce the variance. If a model has a large variance, the model is unstable. However, for the data we use for Machine Learning (mainly training data), the variance is significant. Otherwise, if the input data is the same vertex, the variance is 0, in this way, multiple input data is equivalent to one data. The following figure shows an example:

This assumption is that a camera collects an image of an object's motion, and the above point indicates the position of the object's motion. If we want to fit these points with a straight line, what direction will we choose? Of course, it is the line marked with signal on the graph. If we simply project these points onto the X or Y axis, the variance obtained from the X and Y axes is similar (because the trend of these points is around 45 degrees, so it is similar to projection to the X or Y axis). If we look at these points using the original XY coordinate system, it is easy to see what the true direction of these points is. However, if we change the coordinate system, the horizontal axis is changed to the signal direction, and the vertical axis is changed to the noise direction, it is easy to find out which direction has a large variance and which direction has a small variance.

 

Generally, the direction of a large variance is the signal direction, and the direction of a small variance is the noise direction. In data mining or digital signal processing, we often need to increase the ratio of the signal to the noise, that is, the signal-to-noise ratio. For example, if we only keep the data in the signal direction, we can make a good approximation of the original data.

 

All the work of PCA is simply to find a group of orthogonal axes in sequence in the original space. The first axis maximizes the variance, the second axis has the largest variance in the plane orthogonal to the first axis, and the third axis has the largest variance in the plane orthogonal to the 1st and 2 axes, in this case, we can find N such coordinate axes in the n-dimensional space, and take the first R to approximate the space, so that we can compress from an n-dimensional space to the R-dimensional space, however, the R coordinate axes we selected can minimize the data loss caused by space compression.

 

Assume that each row of the matrix represents a sample, and each column represents a feature. The matrix language is used to change the coordinate axis of matrix A of M * n, P is a transformation matrix from an n-dimensional space to another n-dimensional space, in the space will be similar to the rotation, tensile changes.

Instead, a M * n matrix A is transformed into an M * r matrix, which will enable n feature, it turns into R feature (r <n). This R is actually a kind of refinement of N feature, and we call it the compression of feature. In mathematical language:

But how does this relate to SVD? Previously, the singular vectors obtained by SVD are arranged in ascending order of the singular values. From the perspective of PCA, the coordinate axis with the largest variance is the first singular vector, the coordinate axis with a large variance is the second singular vector... Let's take a look at the SVD formula:

Multiply a matrix V on both sides of the matrix. Because V is an orthogonal matrix, V transpose multiplied by V to get the unit array I. Therefore, it can be converted into the following formula.

Let's take a look at the following formula and the M * n matrix of a * P into the M * r matrix. Here, V is actually p, that is, a variable vector. Here we compress an M * n matrix to an M * r matrix, that is, compressing columns. If we want to compress rows (In PCA's opinion, to Compress rows, we can understand that some similar samples are merged or some samples with little value are removed.) What should we do? Similarly, we will write a general example of row compression:

In this way, we compress a matrix of m rows to a matrix of R rows, which is the same for SVD. we multiply the two sides of the form of SVD decomposition by the U transpose U'

In this way, we get the row compression formula. It can be seen that PCA is almost a packaging of SVD. If SVD is implemented, PCA is implemented. What's better is that with SVD, we can get PCA in two directions. If we break down the feature value of a' A, we can only get PCA in one direction.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.