Early methods of TopicModel-LSA (implicit Semantic Analysis) SVD

Source: Internet
Author: User

Early methods of TopicModel-LSA (implicit Semantic Analysis) SVD

 

LSA and SVD

The purpose of LSA (implicit semantic analysis) is to discover the hidden semantic dimensions-"Topic" or "Concept" from the text ". We know that in the spatial vector model (VSM) of the document, the document is represented as a multidimensional vector consisting of the probability of occurrence of feature words, the advantage of this method is that the similarity between a query and a document can be converted into vectors in the same space, and different words can be assigned different weights, it has been widely used in text search, classification, and clustering, JAVA Implementation of newsgroup18828 text classifier based on Bayesian algorithm and KNN algorithm and JAVA Implementation of newsgroup18828 text clustering tool based on Kmeans algorithm, MBSAS algorithm and DBSCAN algorithm vector space model is used. However, vector space models do not have the ability to deal with the problem of one-word multi-Definition and one-word multi-word. For example, synonyms are also represented as independent ones in one dimension. When the cosine similarity of a vector is calculated, the user's expected similarity is underestimated; when a word item has multiple meanings, it always corresponds to the same dimension. Therefore, the calculation result overestimated the similarity expected by the user.

 

The introduction of the LSA method can alleviate similar problems. Based on SVD decomposition, we can construct a Low-Rank approximation matrix of the original vector matrix. The specific approach is to use the word term document matrix for SVD decomposition.

 

 

It uses terms as the row and documents as the column to create a large matrix. A total of t rows and d columns are set. The elements of the matrix are the tf-idf values of word items. Then, the first k of the r pairs are retained (the maximum k is retained), and the smallest r-k singular values are set to 0. Finally, an approximate decomposition matrix is calculated.

 

 

In the sense of least square, it is the best approximation. Because it contains a maximum of k non-zero elements, the rank cannot exceed k. Through SVD decomposition approximation, we convert the original vector into a low-dimensional hidden Semantic Space, which plays a role in feature dimensionality reduction. Each singular value corresponds to the weight of each "semantic" dimension. The less important weight is reset to 0. Only the most important dimension information is retained, and some information "nosie" is removed ", therefore, we can obtain a better representation of the document. For JAVA Implementation of applying SVD decomposition and dimensionality reduction to document clustering, see this article.

 

Legend of SVD decomposition description:

For example, Data is a 32*32 image matrix. After SVD decomposition

Both U and VT are 3 2*2 matrices with two singular values. Therefore, the total number of digits is 64 + 64 + 2 = 130.

Compared with the original number 1024 = 32*32, we achieved a compression ratio of almost 10 times.

Peter Harrington-Machine Learning in Action]

 

An example of SVD dimensionality reduction given in IIR is as follows:

The left side is the SVD decomposition of the original matrix, and the right side is the case where only the maximum weight of the original matrix is retained and the original matrix is reduced to two dimensions.

PS:
Although the LSA Based on SVD has achieved some success, it lacks a rigorous mathematical statistics basis, and the SVD decomposition is very time-consuming.

Hofmann proposed the PLSA model based on probability statistics on SIGIR '99, and learned the model parameters using the EM algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.