SVD application-LSI

Source: Internet
Author: User

Latent Semantic Indexing is an algorithm that relies heavily on SVD. This article is reproduced from the summary of Wu Jun's beautiful mathematics and the reference document mathematics in machine learning.

------------

In natural language processing, the two most common categories of classification problems are classifying texts by subject (for example, classifying all the news about the Asian Games to sports) and classify words in the vocabulary by meaning (for example, classify names of various sports into one type ). Both classification problems can be solved satisfactorily and simultaneously through matrix operations. To illustrate how to use the matrix tool class to solve these two problems, let's first review our methods in cosine theorem and news classification.

The key to classification is computing relevance. We first calculate the content words of the two texts, or the vectors of the real words, and then obtain the angle between the two vectors. News is related when the angle between the two vectors is zero. News is irrelevant when they are vertical or orthogonal. Of course, the cosine of the angle is equivalent to the Inner Product of the vector. Theoretically, this algorithm is very good. However, the computing time is particularly long. Generally, there are a large number of articles to be processed, at least one million or more articles, and the secondary record is very long. For example, there are 0.5 million words (including product names of personal names and place names ). If you want to find out all articles on the same topic by comparing the two articles in 1 million articles in pairs, You need to compare the articles in 500 billion. Currently, computers can compare up to one thousand pairs of articles in one second. It takes 15 years to compare the relevance of these 1 million articles. Note that the above calculation must be repeated to truly complete the classification of the article.

In text classification, another method is to use Singular Value Decomposition (SVD) in matrix operations ). Now let's take a look at how Singular Value Decomposition is going on. First, we can use a large matrix A to describe the associations between these 1 million articles and 0.5 million words. In this matrix, each row corresponds to an article, and each column corresponds to a word.


In the preceding figure, M = 1,000,000, n = 500,000. Line I, the element of column J, is the weighted Word Frequency (for example, TF/IDF) of the J-th word in the dictionary in the I-th article ). Readers may have noticed that this matrix is very large, with 1 million multiplied by 0.5 million, that is, 500 billion elements.

The Singular Value Decomposition is to divide the preceding large matrix into three small matrices and multiply them, as shown in. For example, the matrix in the above example is decomposed into a matrix X with 1 million multiplied by one hundred, a matrix B with one hundred multiplied by one hundred, and a matrix Y with one hundred multiplied by 0.5 million. The total number of elements in these three matrices is only 0.15 billion, which is only 1/3000 of the original number. The corresponding storage capacity and computing workload are smaller than three orders of magnitude.


The three matrices have very clear physical meanings. Each column in the first matrix X represents a type of topic, and each non-zero element represents the correlation between a topic and an article. The larger the value, the more relevant. Each column in The Last matrix Y represents 100 keywords, and the correlation between each key word and 500,000 words. The matrix in the middle indicates the correlation between the topic and keyword. Therefore, we only need to perform a Singular Value Decomposition on correlated matrix A, and W can complete both the synonym classification and the document classification. (Get the correlation between each type of article and each type of word ).

For example, if it is reduced to 2 dimensions (Rank = 2), the relationship between document-term can be displayed in the following two-dimensional graph:


 

On the graph, each red point represents a word, and each blue point represents a document, so that we can cluster these words and documents, for example, stock and market can be put in one category, because they always appear together, real and estate can be put in one category, and dads and guide seem a little isolated, we will not merge them. According to the clustering effect, synonyms in the document set can be extracted, so that when users retrieve documents, they use the semantic level (Synonym set) for retrieval, instead of the level of the previous word. This reduces our retrieval and storage capacity, because the compressed document set is the same as that of PCa. Second, it can improve our user experience. Users enter a word, we can find this word in the word's synonym set, which is not possible in traditional indexes.


The only question left now is how to use a computer to perform Singular Value Decomposition. At this time, many concepts in linear algebra, such as matrix feature values, and various numerical analysis algorithms are all used. For a long time, Singular Value Decomposition cannot be processed in parallel. (Although Google already has mapreduce and other parallel computing tools, it is difficult to split the Singular Value Decomposition into irrelevant suboperations, even before Google, the advantages of parallel computing cannot be used to break down the matrix .) Recently, Dr. Zhang Zhiwei of Google China and several Chinese engineers and interns have implemented parallel Singular Value Decomposition algorithms. I think this is a contribution of Google China to the world.


Finally, let's talk about my personal opinions. Here we can add a latent semantics item between document and term (word, then, the X and Y matrices represent the correlation between different documents of the same latent semantics and the correlation between the same latent semantics in different terms. The sizes of X and Y are M * r and r * n respectively. r is the rank (rank) of the matrix. Finally, B is the diagonal matrix (R * r) composed of the r Singular Values of A, which is also the r feature value of A in spectral decomposition.



For more discussions and exchanges on NLP, linear algerbra, and ML, stay tuned to this blog and Sina Weibo Rachel Zhang.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.