Copyright:
This article by LeftNotEasy released in http://leftnoteasy.cnblogs.com, this article can be all reproduced or part of the use, but please note the source, if there is a problem, please contact the wheeleast@gmail.com
Preface:
The last time I wrote about PCA and LDA, there are two ways to implement PCA: one is to implement feature value decomposition and the other is to implement it using Singular Value Decomposition. In the previous article, it is an explanation based on feature decomposition. Most people often find feature values and Singular Values in pure mathematical calculations. In addition, in linear algebra or matrix theory, we seldom talk about any application background related to feature values and singular values. Singular Value Decomposition is a method with obvious physical significance. It can represent a complex matrix by multiplying smaller and simpler submatrices, these small matrices describe important features of matrices. Just like describing a person, I would like to describe to others that this person is very eye-catching, has a square face, has a cheek, and has a black-box glasses. There are just a few of these characteristics, let others have a clear understanding in their minds. In fact, there are countless characteristics on the human face. The reason why we can describe this is that, because human beings have a very good ability to extract important features and let machines learn to extract important features, SVD is an important method.
In the field of machine learning, a considerable number of applications and singular values can be associated. For example, PCA for feature reduction and data compression (represented by image compression) algorithms, latent Semantic Indexing)
In addition, I complained here that previously I searched for SVD in Baidu, and the results were all a russian sniper rifle (AK47 in the same time), because there was a sniper rifle named SVD in the game that crossed the firewire, however, when searching on Google, Singular Value Decomposition (mainly in English) is used ). I want to play war games. Isn't it very good to play COD? It's amazing to play CS in shanzhai. The right to speak on Chinese Web pages is also occupied by those posts that do not have much nutrition. I sincerely hope that the atmosphere in China will be more intense. Game Players really like to make games, and Data Mining players really like to dig Data, not just for mixed meals, in this way, it makes sense to talk more than other people. In Chinese articles, there are too few things about technology in a down-to-earth manner. To change this situation, start with me.
As mentioned above, this article mainly focuses on some characteristics of singular values. In addition, we will mention the singular value calculation. However, this article is not prepared to expand too much on how to calculate the singular value. In addition, some of the linear algebra knowledge in this article is not too deep. If we forget linear algebra completely, it may be difficult to read this article.
I. Basic knowledge about Singular Values and feature values:
Feature value decomposition and Singular Value Decomposition are both invisible methods in machine learning. There is a very close relationship between the two. As I will talk about next, feature value decomposition and Singular Value Decomposition aim at the same purpose, that is, extracting the most important feature of a matrix. Let's talk about feature value decomposition first:
1)Feature value:
If A vector v is A feature vector of phalanx A, it can be expressed in the following form:
In this case, λ is called the feature value corresponding to feature vector v. A group of feature vectors in a matrix is a group of orthogonal vectors. Feature value decomposition refers to dividing a matrix into the following forms:
Q is A matrix composed of feature vectors of this matrix A. Σ is A diagonal matrix, and each element on the diagonal is A feature value. I have referenced some references here to illustrate. First, it should be clear that a matrix is actually a linear transformation, because a matrix multiplied by a vector is actually equivalent to linear transformation of this vector. For example, the following matrix:
In fact, the corresponding linear transformation is in the following form:
Because the result of multiplying the matrix M by a vector (x, y) is:
The matrix above is symmetric, so this transformation is an tensile transformation to the direction of x and Y axes (each element on the diagonal line will undergo a tensile transformation for a dimension, when the value is greater than 1, it is longer and the value is shorter than 1). When the matrix is not symmetric, let's say that the matrix is like the following:
The transformations described here are as follows:
In fact, this is the tensile transformation of an axis on the plane (as shown by the Blue Arrow). In the figure, the Blue Arrow is the mostMainChange direction (there may be more than one change direction ),If we want to describe a transformation, we can describe the main change direction of this transformation.. Let's look at the formula of feature value decomposition. The resulting Σ matrix is a diagonal matrix with feature values arranged from large to small, the feature vectors corresponding to these feature values are used to describe the change direction of the matrix (from major changes to secondary changes)
When a matrix is a high-dimensional matrix, this matrix is a linear transformation in a high-dimensional space. This linear change may not be represented by images, but as you can imagine, this transformation also has many transformation directions. The first N feature vectors obtained through feature value decomposition correspond to the N most important change directions of this matrix. We can use the first N variation directions to approximate this matrix (transformation ). That is, as mentioned earlier:Extract the most important features of this matrix.To sum up, feature value decomposition can obtain feature values and feature vectors. Feature values represent how important a feature is, and feature vectors represent the feature, every feature vector can be understood as a linear sub-space. We can use these linear sub-spaces to do a lot of things. However,Feature value decomposition also has many limitations. For example, the transformed matrix must be a square matrix.
(After talking about so many feature value transformations, I wonder if you have made it clear. Please give me more comments .)
2) singular values:
Next we will talk about Singular Value Decomposition. Feature value decomposition is a good method for extracting matrix features, but it is only for square arrays. In the real world, most of the matrices we see are not square arrays, for example, if there are N students and each student has M scores, the matrix of N * M cannot be a square matrix,How can we describe the important features of such a general matrix?Singular Value Decomposition can be used to do this,Singular Value Decomposition is a method that can be applied to any matrix.:
Assume that A is A matrix of N * M, the resulting U is A matrix of N * N (the vectors inside are orthogonal, And the vectors Inside U are called left singular vectors ), Σ is a matrix of N * M (except that the elements on the diagonal line are all 0, the elements on the diagonal line are called singular values), and V' (V transpose) is a matrix of N * N, the vectors in them are also orthogonal, And the vectors in V are called the right singular vectors. The following figure is available for reflecting the size of several multiplied matrices from the image.
So how does the singular value correspond to the feature value? First, we transpose A matrix A to * A and obtain A square matrix. We can use this square matrix to obtain the feature value: the obtained v is the right singular vector above. In addition, we can also get:
Here σ is the singular value mentioned above, and u is the left singular vector mentioned above. The singular value σ is similar to the feature value. In the matrix Σ, it is also arranged from large to small, and the reduction of σ is particularly fast,In many cases, the sum of the singular values of the first 10% or even 1% accounts for more than 99% of the total Singular Values.. That is to say, we can use the Singular Values of the former r to describe the matrix.Partial Singular Value Decomposition:
R is a number far less than m and n, so the multiplication of the matrix looks like the following:
The result of multiplication of the three matrices on the right will be A matrix close to A. Here, the closer r is to n, the closer the result of multiplication is to. The sum of the areas of these three matrices (in terms of storage, the smaller the matrix area, the smaller the storage space) is much smaller than the original matrix, if we want to compress the space to represent the original matrix A, we can save the three matrices here: U, Σ, and V.
Ii. singular value calculation:
The Calculation of singular values is a difficult problem. It is an O (N ^ 3) algorithm. In the case of a single machine, there is no problem. matlab can calculate all the Singular Values of the matrix of 1000*1000 in one second, but when the scale of the matrix increases, the complexity of computing increases to the power of three, and parallel computing is required. Google'sWu JunInstructor inThe beauty of MathematicsWhen talking about SVD, the series talked about Google's implementation of the SVD parallel algorithm. This is a contribution to humans, but it does not provide a specific computing scale, and did not provide much valuable information.
In fact, SVD can still be implemented in parallel. When solving large-scale matrices, the iterative method is generally used. When the matrix size is large (for example, hundreds of millions, the number of iterations may also reach hundreds of millions. If the Map-Reduce framework is used for decoding, the operations to write and read files will be involved each time the Map-Reduce is completed. I personally guess that in the Google cloud computing system, apart from Map-Reduce, there should be a computing model similar to MPI, that is, the communication between nodes is maintained, and the data is resident in the memory, this computing model is much faster than Map-Reduce when there are many iterations.
Lanczos iteration is a solutionPartial feature value of symmetric Square Matrix(As mentioned earlier, the feature value of the symmetric matrix obtained by solving a' * A is the right singular vector of Solution ), is to convert a symmetric equation into a three diagonal matrix and then solve. According to some documents on the Internet, Google should use this method to perform Singular Value Decomposition. See some cited papers in Wikipedia. If you understand those papers, you can make an SVD.
The Calculation of singular values is a boring, pure mathematical process, and the previous research results (in the paper) have almost given the entire program flow chart. For more information about Singular Value computing, see the references below. I will not go deep here. I will focus on the application of Singular Values.
Iii. Singular Value and principal component analysis (PCA ):
In the previous section, principal component analysis also talked about how to use SVD to solve PCA. The PCA problem is actually a base transformation, which makes the transformed data have the largest variance. The variance size describes the amount of information about a variable. When we talk about the stability of a thing, we often say that we need to reduce the variance. If a model has a large variance, the model is unstable. However, for the data we use for Machine Learning (mainly training data), the variance is significant. Otherwise, if the input data is the same vertex, the variance is 0, in this way, multiple input data is equivalent to one data. The following figure shows an example:
This assumption is that a camera collects an image of an object's motion, and the above point indicates the position of the object's motion. If we want to fit these points with a straight line, what direction will we choose? Of course, it is the line marked with signal on the graph. If we simply project these points onto the X or Y axis, the variance obtained from the X and Y axes is similar (because the trend of these points is around 45 degrees, so it is similar to projection to the X or Y axis). If we look at these points using the original xy coordinate system, it is easy to see what the true direction of these points is. However, if we change the coordinate system, the horizontal axis is changed to the signal direction, and the vertical axis is changed to the noise direction, it is easy to find out which direction has a large variance and which direction has a small variance.
Generally, the direction of a large variance is the signal direction, and the direction of a small variance is the noise direction. In data mining or digital signal processing, we often need to increase the ratio of the signal to the noise, that is, the signal-to-noise ratio. For example, if we only keep the data in the signal direction, we can make a good approximation of the original data.
All the work of PCA is simply to find a group of orthogonal axes in sequence in the original space. The first axis maximizes the variance, the second axis has the largest variance in the plane orthogonal to the first axis, and the third axis has the largest variance in the plane orthogonal to the 1st and 2 axes, in this case, we can find N such coordinate axes in the N-dimensional space. We take the first r to approximate this space, in this way, the space is compressed from an n-dimensional space to the r-dimensional space, but the r-axis we choose can minimize the data loss caused by space compression.
Assume that each row of the matrix represents A sample, and each column represents A feature. The matrix language is used to change the coordinate axis of matrix A of m * n, P is a transformation matrix from an n-dimensional space to another n-dimensional space, in the space will be similar to the rotation, tensile changes.
Instead, A m * n matrix A is transformed into an m * r matrix, which will enable n feature, it turns into r feature (r <n). This r is actually a kind of refinement of n feature, and we call it the compression of feature. In mathematical language:
But how does this relate to SVD? Previously, the singular vectors obtained by SVD are arranged in ascending order of the singular values. From the perspective of PCA, the coordinate axis with the largest variance is the first singular vector, the coordinate axis with a large variance is the second singular vector... Let's recall the previously obtained SVD formula:
Multiply a matrix V on both sides of the matrix. Because V is an orthogonal matrix, V transpose multiplied by V to get the unit array I. Therefore, it can be converted into the following formula.
Let's take A look at the following formula and the m * n matrix of A * P into the m * r matrix. Here, V is actually P, that is, a variable vector. Here we compress an m * n matrix to an m * r matrix, that is, compressing columns. If we want to compress rows (In PCA's opinion, to Compress rows, we can understand that some similar samples are merged or some samples with little value are removed.) What should we do? Similarly, we will write a general example of row compression:
In this way, we compress a matrix of m rows to a matrix of r rows, which is the same for SVD. we multiply the two sides of the form of SVD decomposition by the U transpose U'
In this way, we get the row compression formula. It can be seen that PCA is almost a packaging of SVD. If SVD is implemented, PCA is implemented. What's better is that with SVD, we can get PCA in two directions. If we break down the feature value of a' A, we can only get PCA in one direction.
Iv. Singular Value and latent semantic index LSI:
Latent Semantic Indexing is not the same as PCA. At least it can be used directly without SVD. However, LSI is also an algorithm heavily dependent on SVD, previously, Mr. Wu talked about the classification in matrix computing and text processing:
"The three matrices have very clear physical meanings. Each row in the first matrix X represents a type of words related to the meaning, and each non-zero element represents the importance (or correlation) of each word in this type of words. The greater the value, the more relevant. Each column in The Last matrix Y represents an article of the same topic, and each element represents the relevance of each article in this article. The matrix in the middle represents the correlation between the class words and the article thunder. Therefore, we only need to perform A Singular Value Decomposition on correlated matrix A, and w can complete both the synonym classification and the document classification. (Get the correlation between each type of article and each type of Word )."
The above section may not be easy to understand, but this is the essence of LSI. I will give an example below. The example below comes from LSA tutorial, the specific URL will be provided in the final reference:
This is a matrix, but it is not the same. Here, one line indicates the title of a word (one-dimensional feature as mentioned earlier ), A column represents the words in a title. (This matrix is actually a transpose in the form of sample as we mentioned earlier, this will change the meaning of our left and right singular vectors, but will not affect our computing process ). For example, the title T1 contains four words: guide, investing, market, and stock. Each word appears once. We use the SVD matrix to obtain the following matrix:
The left singular vector represents some of the characteristics of the word, the right singular vector represents some of the characteristics of the document, the singular value matrix in the middle represents a line of the Left singular vector and an important program of a column of the right singular vector, the larger the number, the more important it is.
Looking at this matrix, we can also find some interesting things. First, the first column of the Left singular vector indicates the occurrence frequency of each word, although not linear, but it can be considered as a rough description. For example, book appears twice in the corresponding document of 0.15, and investing is 0.74 corresponds to 9 times in the document, rich is shown three times in the corresponding document of 0.36;
Secondly, the first line in the right singular vector represents the approximate number of words in each document. For example, T6 is 0.49, 5 words appear, T2 is 0.22, two words appear.
Then, we can take the left singular vector and the right singular vector into the last two dimensions (a matrix of the last three dimensions) and project it to a plane. We can get:
On the graph, each red point represents a word, and each blue point represents a document, so that we can cluster these words and documents, for example, stock and market can be put in one category, because they always appear together, real and estate can be put in one category, and dads and guide seem a little isolated, we will not merge them. According to the clustering effect, synonyms in the document set can be extracted, so that when users retrieve documents, they use the semantic level (Synonym set) for retrieval, instead of the level of the previous word. This reduces our retrieval and storage capacity, because the compressed document set is the same as that of PCA. Second, it can improve our user experience. Users enter a word, we can find this word in the word's synonym set, which is not possible in traditional indexes.
I don't know. Let's take a look at Wu's article. Is it clearer about SVD? :-D
References:
1) A Tutorial on Principal Component Analysis, Jonathon Shlens
This is my main reference for using SVD for PCA.
2) http://www.ams.org/samplings/feature-column/fcarc-svd
Here is a good introduction to the concept of svd.
3) http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
Another article on svd
4) http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html
Here is an example of svd and LSI.
Http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html (5)
Another article on svd and LSI is still good, deeper and longer.
6) Singular Value Decomposition and Principal Component Analysis, Rasmus Elsborg Madsen, Lars Kai Hansen and Ole Winther, 2004
(1) The article is similar.