04:SVD decomposition of matrix method in machine learning

Source: Internet
Author: User

04:SVD decomposition of matrix method in machine learning

Before we talked about QR decomposition there are some good features, but QR decomposition is only the line of the matrix operation (left by a unitary matrix), you can get the column space. The SVD decomposition of this section is the same as the row and column, both the left and the unitary matrix, and right by the unitary matrix, you can derive more interesting information. Singular value decomposition (SVD, Singular value decomposition) is applied in the calculation of pseudo-inverse (pseudoinverse) of matrices, the optimal solution of least squares, the matrix approximation, the determination of the matrix's column vector space, the rank and the solution set space of the linear system.

1. The form of SVD

For an arbitrary matrix of MXN A,SVD a matrix is decomposed into three special matrices of the product, foreign also made a video--SVD song:

Where U and is the unitary matrix, is the diagonal matrix. Note that the unitary matrix is only the transformation of coordinates, the shape of the distribution of the data itself has not changed, and the diagonal matrix is to stretch or compress the data. Because M >= N, it can also be written as the following thin SVD form:

2. The geometrical interpretation of SVD

Consider A is a 2x2 simple case. We know that a geometric shape left by a matrix A is actually the shape of rotation, symmetry, stretching transformation, such as the wrong-cut transformation, A is called the shear matrix. For example, a circle can be given a rotated ellipse by the left multiplier A.

For example, suppose a transforms a circle on a plane:

If we look at matrix A only, we can hardly see how the circle is transformed. The circle before the transformation is:

C, M, Y, K, respectively, represent the 第一、二、三、四 quadrant. Left multiply A matrix to get:

Our question is, how many degrees of rotation? What is the direction of scaling? What is the maximum scaling ratio?

Use this online tool to perform SVD decomposition of A:

Obviously, we see that the A-transform is actually the first clockwise rotation of 45° (can be seen as the axis counter-clockwise declaration of 45°, the main component direction), and then about the X axisymmetric (the first row multiplied by-1), that is, the left v^t:

Then Stretch 3 times times in the x direction (left multiply s):

Finally, the clockwise rotation of 45 °, and then about the x-axis symmetry (the first row multiplied by 1, two times the symmetry operation offset), that is, left multiply U:

The Python code for the above drawing process is as follows:

 1 #-*-Coding=utf-8 2 #!/usr/bin/python2.7 3 4 from Pylab Import * 5 6 7 def plotcircle (before, M=matrix ([[1, 0], [0 , 1]])): 8 "' Before: Matrix before transformation 9 m: transformation matrix, default unit matrix 10 returns the matrix after transformation one" "Eclmat = M * before # m transform E ClX = Array (eclmat[0]). Reshape ( -1) ecly = Array (eclmat[1]). Reshape ( -1) axis (' equal ') axis ([-3, 3,-3, 3] ) Grid (True)-plot (eclx[:25], ecly[:25], ' C ', linewidth=3) plot (eclx[25:50], ecly[25:50], ' m ', linewidth= 3) plot (eclx[50:75], ecly[50:75], ' y ', linewidth=3) plot (eclx[75:100], ecly[75:100], ' K ', linewidth=3) How () EclMat24 x = linspace (0, 2*PI, +) × = cos (ang)-y = sin (ang)-Cirmat = Matrix ([x, Y]) # 2x100 Circle Matrix 31 32 # Draw The very beginning of the graph--round plotcircle (Cirmat) 34 35 # The Ellipse after the transform is drawn M = Matrix ([[2, 1], [1, 2]]) # 2x2 transformation Matrix PNS Clma t = plotcircle (Cirmat, M) 38 39 # SVD decomposition of M-matrices, s, V = NP.LINALG.SVD (M, full_matrices=true), S = Np.diag (s) 42 43 # The transformation of the SVD matrix to a circle plotcircle(cirmat) Tran1 = Plotcircle (Cirmat, V) Tran2 = Plotcircle (Tran1, S) Tran3 = Plotcircle (Tran2, U) 

3. SVD vector space

Assuming that the rank of matrix A is R, then the rank of the diagonal matrix is also r (multiplied by the unitary matrix does not change the rank of the matrix), we assume:

    1. So, what is the null space of Ax = 0, which is the solution space? The answers are as follows:


      The proof is very simple, directly into the SVD decomposition of the right side of three matrices, when the orthogonal vector multiplication is equal to 0, not 0 is exactly the diagonal matrix of 0 elements are zero, the equal sign is established. If a is full-rank, it is clear that only 0 vectors are left to match the criteria. Similarly, you can know the null space of a '.

    2. What is the linear subspace of Matrix A? Suppose there is a very high matrix A, and its column vectors are probably not orthogonal. For any coordinate x, the linear subspace of matrix A can be defined as:

      R (A) = {y | y = Ax, x is any coordinate}
      So, U1, U2,..., ur is an orthogonal base of R (A).
      This idea is also very intuitive, no matter what vector (or coordinates), after the v^t and diagonal matrix transformation is still a coordinate, this coordinate is the matrix column vector linear combination of coefficients.

4. Calculation of SVD

This is the diagonal of the positive definite matrix. The calculation process is as follows:

    1. Calculate a ' (transpose of a) and a ' a '
    2. The eigenvalues of a ' A are computed, the eigenvalues are arranged in descending order, and the singular value of a is obtained.
    3. The diagonal matrix S can be constructed by singular values, while the s inverse is obtained for the subsequent calculation
    4. The eigenvalues of the above sequence can be calculated to find the corresponding eigenvector, the matrix V is obtained by the eigenvector, and the V ' is obtained after transpose.
    5. U = AVS Inverse, find U, complete.

Lanczos algorithm is an iterative method of calculation, and no time to look closely.

The feeling of the SVD calculation of water is very deep, to use the time to see, and now temporarily not in depth.

5. PCA

Principal component analysis is often used to reduce the number of dimensions in a dataset while maintaining the characteristics that contribute the most to the difference in the data set. Assuming that the above ellipse (two-dimensional space, two coordinate values) distributes a very large number of points evenly, how can we differentiate the points in one-dimensional space (a coordinate value) to the fullest extent? This is where the PCA is used:

As you can see, all the points are projected onto the +45° line, and the two-dimensional space is mapped to one-dimensional space, which can be separated to the maximum extent, that is, the sample distribution variance is the largest after projection, which is the direction of the V1 vector. It is assumed that the 100x2 matrix X represents the coordinates of the sample point on the plane, projecting the points onto the V1 to retain the largest sample variance:

In this way, the sample points of 100 2-D spaces are compressed into sample points of 100 1-dimensional spaces, where column compression actually compresses the features rather than simply discarding them. Can PCA be compressed only when the number of sample points is greater than the number of features? For example, a sample of 3 100-D features is now represented by a matrix of 100x3, which is decomposed by SVD to obtain the product of three matrices, making the following transformations:

The main ingredient is also kept to a minimum.

Summing up, a matrix A, if you want to compress the row and retain the main component, then left multiply U1 ', if the relative column is compressed and retains the principal component (let me think of sparse representation), then right-multiply v1.

Of course, the above is simple to retain the first principal component, the whole work of PCA simple point, is the original space in order to find a set of orthogonal axes, the first axis is to make the most variance, the second axis is orthogonal to the first axis of the plane to make the most variance, the third axis is in the 1th, 2 axes orthogonal plane of the biggest difference, so that in the n-dimensional space, we can find n such an axis, we take the former R to approximate the space, so that from an n-dimensional space compression to the R-dimensional space, but our choice of R axis can make the space compression makes the data loss is minimal.

6. Least squares problem

The idea is similar to the QR decomposition approach, mainly using the length invariance of the unitary matrix transformation:

The optimal solution is:

04:SVD decomposition of matrix method in machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.