# Mathematics in Machine learning (5)-powerful matrix singular value decomposition (SVD) and its application

Source: Internet
Author: User

Objective:

Last time I wrote about PCA and LDA, there are two general implementations of PCA, one is realized by the decomposition of eigenvalue, and one is realized by singular value decomposition. In the previous article, it was an interpretation based on eigenvalue decomposition. Eigenvalues and singular values in most people's impressions are often stuck in purely mathematical calculations. Moreover, in linear algebra or matrix theory, there is seldom any application background related to eigenvalues and singular values. Singular value decomposition is a very obvious physical meaning of a method, it can be a more complex matrix with smaller and simpler several sub-matrices of the multiplication to represent, these small matrices describe the matrix of important characteristics. It is like describing a person, to describe to others that this person is bushy, square face, beard, and with a black frame of the glasses, such a few characteristics, let others mind inside there is a more clear understanding, in fact, the characteristics of human face is an infinite variety of, the reason can be described, Because people are born with a very good ability to extract important features, so that the machine learns to extract important features, SVD is an important method.

In the field of machine learning, there are quite a number of applications and singular values can be related, such as feature reduction PCA, do data compression (image compression as the representative) of the algorithm, as well as search engine semantic level search LSI (latent Semantic Indexing)

In addition to complain here, before Baidu inside search for SVD, out of the results are a kind of Russian sniper gun (AK47 with the Times), is because the game through the fire inside a sniper gun called SVD, and Google search on the time, out of the singular value decomposition (English data-based). Want to play war games, play with cod is not very good, play the cottage of CS God horse meaning ah. The discourse in domestic web pages is also occupied by posts that do not have much nutrition. Sincerely hope that the atmosphere in the domestic can be more strong, the game is really like the production of games, data mining people are really like to dig, not just to eat food, so talk beyond others have meaning, Chinese article, can talk about technology too little, change the situation, Do it from me.

In front of that, this article focuses on some characteristics of singular values, and also slightly mentions the calculation of singular values, but this article does not intend to expand too much on how to calculate singular values. In addition, this article contains some of the knowledge of linear algebra is not too deep, if you completely forget the linear algebra, it may be difficult to see this article.

First, singular values and eigenvalues of the basic knowledge:

Eigenvalue decomposition and singular value decomposition are all visible methods in the field of machine learning. There is a close relationship between the two, and I'll talk about both eigenvalue decomposition and singular value decomposition for the same purpose, which is to extract the most important features of a matrix. Let's talk about eigenvalue decomposition:

1) characteristic value:

If a vector v is a eigenvector of square A, it will certainly be represented in the following form:

At this time λ is called the eigenvalues of the Eigenvector V, and a set of eigenvectors of a matrix is a set of orthogonal vectors. Eigenvalue decomposition is the decomposition of a matrix into the following form:

where q is a matrix of the eigenvectors of this matrix A, σ is a diagonal array, and each diagonal element is a characteristic value. I have quoted some references in this article to illustrate. First of all, to be clear, a matrix is actually a linear transformation, because a matrix is multiplied by a vector, which is actually equivalent to a linear transformation of the vector. For example, one of the following matrices:

It actually corresponds to a linear transformation in the following form:

Because this matrix m multiplied by a vector (x, y) results in:

The matrix above is symmetric, so this transformation is an X-y axis in the direction of a stretch transform (each diagonal element will be stretched on a dimension, the value of the >1, is elongated, the value <1 time shortening), when the matrix is not symmetrical, if the matrix is the following look:

The transformation it describes looks like this:

This is actually a stretch transform on the plane to an axis (as shown by the Blue arrows), in the figure, the blue arrow is one of the most important direction of change (there may be more than one direction of change), if we want to describe the good one transformation, Well, let's just describe the main change direction of the transformation . Looking back at the previous eigenvalue decomposition equation, the decomposition of the σ matrix is a diagonal array, the inside of the eigenvalues are from large to small arrangement, the eigenvalues of the corresponding eigenvector is to describe the direction of the matrix change (from major changes to minor changes in the arrangement)

When the matrix is a high-dimensional case, then the matrix is a linear transformation in the high-dimensional space, this linear change may not be represented by the image, but it can be imagined that the transformation also has a lot of direction of transformation, we are decomposed by the eigenvalue of the first n eigenvectors, Then it corresponds to the main n change direction of the matrix. We can approximate this matrix (transformation) by using the first n direction of change. That's what we said before: extracting the most important features of the matrix. to sum up, eigenvalue decomposition can get eigenvalues and eigenvectors, the eigenvalues represent how important this feature is, and the feature vectors represent what this feature is, and each eigenvector can be understood as a linear subspace, and we can take advantage of these linear subspace to do a lot of things. However, eigenvalue decomposition also has a lot of limitations, such as the transformation of the matrix must be a square.

(Having said so many eigenvalue transformations, I do not know whether it is clear, please make more comments.) ）

2) Singular value:

Here we talk about singular value decomposition. Eigenvalue decomposition is a very good way to extract the matrix features, but it is only for the square, in the real world, we see most of the matrix is not a square, for example, there are n students, each student has m science scores, so that the formation of an n * M matrix can not be a square, How can we describe the important characteristics of such a common matrix? singular value decomposition can be used to do this thing, singular value decomposition is a method that can be applied to any matrix of decomposition :

Suppose A is a matrix of n * m, so the resulting u is a square of n * N (the vector inside is orthogonal, the vector inside U is called the left singular vector), Σ is a matrix of n * m (except that the diagonal elements are 0, the diagonal elements are called singular values), V ' (v transpose) is an n * n Matrix, The vector inside is also orthogonal, the vector inside V is called the right singular vector, and the image reflects the size of several multiplying matrices to get the following picture

So how do singular values and eigenvalues correspond? First, we will transpose a matrix A, we will get a square, we use this square to find the eigenvalues can be obtained: Here the V is the right singular vector above us. In addition we can also get:

Σ here is the singular value mentioned above, and U is the left singular vector mentioned above. Singular value σ is similar to the eigenvalues, in the matrix σ is also from the large to the small arrangement, and σ reduction is particularly fast, in many cases, the first 10% or even 1% of the singular value of the sum of the total singular value of more than 99% . In other words, we can also approximate the matrix with the singular value of the former R large, which defines the partial singular value decomposition :

R is a number that is much smaller than M, N, so that the multiplication of the matrix looks like this:

The result of multiplying the three matrices on the right will be a matrix close to a, where R is closer to N, and the result of multiplying is closer to a. And the area of the three matrices (in the storage point of view, the smaller the size of the matrix, the less storage) is much smaller than the original matrix A, if we want to compress space to represent the original matrix A, we save the three matrix here: U, Σ, V is good.

Second, the calculation of singular value:

The calculation of singular values is a difficult problem and is an O (n^3) algorithm. In the case of a single machine is certainly not a problem, matlab in a second can be calculated to 1000 * 1000 of all the singular values of the matrix, but when the scale of the matrix increases, the complexity of the calculation is 3 times to increase, it is necessary to participate in parallel computation. Google's Wu teacher in the mathematics of the beauty of the series when talking about SVD, the Google implementation of the SVD algorithm, said that this is a contribution to mankind, but also did not give a specific scale of calculation, nor give too much valuable information.

In fact, SVD can be implemented in parallel way, in the solution of large-scale matrix, the general use of iterative methods, when the scale of the matrix is very large (for example, hundreds of billions), the number of iterations may also be hundreds of billions of times, if using the map-reduce framework to solve, Each time the map-reduce completes, it involves the operation of writing files and reading files. Personal speculation in the Google Cloud computing system, in addition to map-reduce, there should be a similar to the MPI model, that is, the node is to maintain communication between the data is resident in memory, the calculation model than Map-reduce in the number of iterations to solve a lot of times, much faster.

The Lanczos iteration is a method to solve the partial eigenvalue of a symmetric square (the eigenvalue of a symmetric matrix obtained by a ' * a ') is a right singular vector of solution A, and a symmetric equation is transformed into a three-diagonal matrix and then solved. According to some of the online literature, Google should be done in this way to do the singular value decomposition. See some of the cited papers on Wikipedia, and if you understand those papers, you can also "almost" make a SVD.

Because the singular value calculation is a very dull, pure mathematics process, and the previous research results (in the thesis) almost already gave the whole process flowchart. More about the singular value calculation will be given in a later reference, no further here, I am still focus in the application of singular values.

Three, singular value and principal component Analysis (PCA):

Principal component analysis is also discussed in the previous section, where we talk about how to use SVD to solve the problem of PCA. The PCA problem is actually a base transformation that makes the transformed data have the greatest variance. The size of the variance describes the amount of information about a variable, when we talk about the stability of a thing, often said to reduce the variance, if the variance of a model is very large, it shows that the model is unstable. But for our machine learning data (mainly training data), the variance is significant, otherwise the input data are the same point, the variance is 0, so that the input of multiple data is equivalent to a data. Take the following picture as an example:

The hypothesis is that a camera captures a picture of an object's motion, the point above indicates the position of the object's motion, and if we want to fit these points in a straight line, what direction would we choose to floss? Of course, the line marked with signal on the figure. If we are simply projecting these points onto the x-axis or the y-axis, the resulting variance on the x-axis and the y-axis is similar (because these points tend to be in the direction of around 45 degrees, so projections are similar to the x-axis or y-axis), and if we use the original XY coordinate system to see these points, It's easy to see what the real direction of these points is. But if we change the coordinate system, the horizontal axis becomes the direction of the signal, the longitudinal axes become the direction of the noise, then it is easy to find out what direction the variance is large and what direction the variance is small.

In general, the direction of large variance is the direction of the signal, the direction of the small variance is the direction of the noise, we in the data mining or digital signal processing, often to improve the ratio of signal to noise, that is, the SNR. For example, if we only keep the data in the signal direction, we can also do a good approximation of the original data.

The whole work of PCA is simply to find a set of orthogonal axes in the original space, the first axis is the most variance, the second axis is the most variance in the plane orthogonal to the first axis, the third axis is the largest in the plane with the 1th and 2 axes orthogonal, so it is assumed that in n-dimensional space , we can find n such an axis, we take the former R to approximate the space, so from an n-dimensional space compression to r-dimensional space, but our choice of R-axis can make the space compression to minimize the loss of data.

Or assume that we have a matrix each row represents a sample, each column represents a feature, expressed in the language of the Matrix, a M * n matrix A of the axis changes, p is a transformation of the matrix from one n-dimensional space to another n-dimensional space, in space will be carried out some similar to rotation, The change in stretching.

Instead, a M * n matrix A is transformed into a matrix of M * R, which makes it possible to have n feature, which turns out to have R feature (r < N), which is actually a refinement of n feature, and we call this compression of feature. In mathematical language, it means:

But how does this relate to SVD? Previously mentioned, SVD derived singular vector is also from the singular value from large to small arrangement, according to the PCA point of view, is the largest variance of the axis is the first singular vector, the variance of the second-largest axis is a singular vector ... Let's recall the SVD equation we got before:

At the same time on both sides of the matrix by the last matrix V, because V is an orthogonal matrix, so V-transpose multiplied by V to get the unit array I, so it can be converted to the back of the equation

Look at the inverse of the matrix with a * p that M * n matrix transformed to M * R, where in fact V is P, which is a vector of changes. Here is a matrix of M * n is compressed into an M * R, that is, the column is compressed, if we want to compress the row (in the PCA view, the row compression can be understood as, some similar sample merged together, or some of the less valuable sample removed) what to do? We also write a common example of row compression:

This compresses the matrix from a M-row to a matrix of r rows, which is the same for SVD, and we multiply the two sides of the SVD decomposition by U's transpose u '

This gives us the formula to compress the rows. Can be seen, in fact, PCA can almost be said to be a package of SVD, if we realize the SVD, that also realized the PCA, and the better place is, with the SVD, we can get two directions of the PCA, if we on a ' A to the decomposition of eigenvalue, can only get a direction of PCA.

Iv. singular value and latent semantic index LSI:

The potential semantic index (latent Semantic indexing) is not the same as the PCA, at least not the SVD can be used directly, but LSI is also a heavily dependent on the SVD algorithm, before the Wu teacher in the matrix calculation and text processing in the classification of the problem mentioned:

"Three matrices have a very clear physical meaning. Each row in the first matrix X represents a class of words related to meaning, where each non-0 element represents the importance (or relevance) of each word in such a word, and the larger the value the more relevant. Each column in the last matrix Y represents a class of articles of the same topic, each of which represents the relevance of each article in such an article. The middle matrix represents the correlation between the class words and the article ray. Therefore, we only need to do a singular value decomposition of the correlation matrix A, w we can complete the classification of synonyms and the classification of articles at the same time. (It also gets the relevance of each type of article and each type of word). ”

The above paragraph may not be easy to understand, but this is the essence of LSI, I take an example below to illustrate, the following example from LSA Tutorial, the specific URL I will give in the final reference:

This is a matrix, but not quite the same, here's a line that shows which title the word appears in (one line is the previously said feature), and one column indicates which words are in a title. (This matrix is actually the one we said before is a kind of transpose of the form of a sample, which will change the meaning of our left and right singular vectors, but will not affect the process of our calculation). For example T1 This title has the guide, the investing, the market, the stock four words, each appeared once, we will this matrix to SVD, obtains the following matrix:

The left singular vector represents some characteristics of the word, the right singular vector represents some characteristics of the document, the middle singular value matrix represents the left singular vector row and the right singular vector of a column of important procedures, the larger the more important the number.

Continue to see this matrix can also find some interesting things, first of all, the first column of the left singular vector indicates the frequency of occurrence of each word, although not linear, but can be considered a general description, For example, book is 0.15 of the corresponding document appears in 2 times, investing is 0.74 corresponding to the document appeared 9 times, Rich is 0.36 in the corresponding document appeared 3 times;

Secondly, the first line in the right singular vector represents the approximate number of occurrences of the word in each document, for example, T6 is 0.49, 5 words appear, T2 is 0.22, and 2 words appear.

Then we look back, we can take the left singular vector and the right singular vector after 2 dimensions (previously 3-dimensional matrix), projected onto a plane, you can get:

On the graph, each red dot represents a word, each blue dot, that represents a document so that we can cluster these words and documents, such as stock and market can be placed in one category, because they always appear together, real and estate can be placed in a class, dads, The word guide looks a little bit isolated, so we don't merge them. By this clustering, you can extract the synonyms in the document collection so that when the user retrieves the document, it is the semantic level (the synonym collection) to retrieve, not the level of the previous word. Such a reduction of our retrieval, storage, because such a compressed document collection and PCA are similar, two can improve our user experience, the user input a word, we can in the word of the synonym of the collection to find, which is the traditional index cannot do.

Do not know according to this description, and then look at Wu Teacher's article, is not the SVD more clear? :-D

Resources:

1) A Tutorial on Principal Component analysis, Jonathon Shlens
This is my main reference to use SVD to do PCA
2) http://www.ams.org/samplings/feature-column/fcarc-svd
a good idea about SVD, a few of my first pictures are   from here;
3) http://www.puffinwarellc.com/index.php/news-and-articles/ articles/30-singular-value-decomposition-tutorial.html
Another introduction to SVD good text
4) http://www.puffinwarellc.com/index.php/news-and-articles/articles/33- latent-semantic-analysis-tutorial.html
SVD and LSI Good article, I later LSI in the example is from this
5) http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html
another SVD and LSI article, also good, deep, and longer
6) Singular Value decomposition and Principal Component Analysis, Rasmus elsborg Madsen, Lars Kai Hansen and Ole Winther, 2004
and 1) the article is similar

(RPM) Mathematics in machine learning (5)-powerful matrix singular value decomposition (SVD) and its Applications

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

## A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• #### Sales Support

1 on 1 presale consultation

• #### After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.