Early methods of TopicModel-LSA (implicit Semantic Analysis) SVD

Last Update:2015-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

LSA and SVD

The purpose of LSA (implicit semantic analysis) is to discover the hidden semantic dimensions-"Topic" or "Concept" from the text ". We know that in the spatial vector model (VSM) of the document, the document is represented as a multidimensional vector consisting of the probability of occurrence of feature words, the advantage of this method is that the similarity between a query and a document can be converted into vectors in the same space, and different words can be assigned different weights, it has been widely used in text search, classification, and clustering, JAVA Implementation of newsgroup18828 text classifier based on Bayesian algorithm and KNN algorithm and JAVA Implementation of newsgroup18828 text clustering tool based on Kmeans algorithm, MBSAS algorithm and DBSCAN algorithm vector space model is used. However, vector space models do not have the ability to deal with the problem of one-word multi-Definition and one-word multi-word. For example, synonyms are also represented as independent ones in one dimension. When the cosine similarity of a vector is calculated, the user's expected similarity is underestimated; when a word item has multiple meanings, it always corresponds to the same dimension. Therefore, the calculation result overestimated the similarity expected by the user.

The introduction of the LSA method can alleviate similar problems. Based on SVD decomposition, we can construct a Low-Rank approximation matrix of the original vector matrix. The specific approach is to use the word term document matrix for SVD decomposition.

It uses terms as the row and documents as the column to create a large matrix. A total of t rows and d columns are set. The elements of the matrix are the tf-idf values of word items. Then, the first k of the r pairs are retained (the maximum k is retained), and the smallest r-k singular values are set to 0. Finally, an approximate decomposition matrix is calculated.

In the sense of least square, it is the best approximation. Because it contains a maximum of k non-zero elements, the rank cannot exceed k. Through SVD decomposition approximation, we convert the original vector into a low-dimensional hidden Semantic Space, which plays a role in feature dimensionality reduction. Each singular value corresponds to the weight of each "semantic" dimension. The less important weight is reset to 0. Only the most important dimension information is retained, and some information "nosie" is removed ", therefore, we can obtain a better representation of the document. For JAVA Implementation of applying SVD decomposition and dimensionality reduction to document clustering, see this article.

Legend of SVD decomposition description:

For example, Data is a 32*32 image matrix. After SVD decomposition

Both U and VT are 3 2*2 matrices with two singular values. Therefore, the total number of digits is 64 + 64 + 2 = 130.

Compared with the original number 1024 = 32*32, we achieved a compression ratio of almost 10 times.

Peter Harrington-Machine Learning in Action]

An example of SVD dimensionality reduction given in IIR is as follows:

The left side is the SVD decomposition of the original matrix, and the right side is the case where only the maximum weight of the original matrix is retained and the original matrix is reduced to two dimensions.

PS:
Although the LSA Based on SVD has achieved some success, it lacks a rigorous mathematical statistics basis, and the SVD decomposition is very time-consuming.

Hofmann proposed the PLSA model based on probability statistics on SIGIR '99, and learned the model parameters using the EM algorithm.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Early methods of TopicModel-LSA (implicit Semantic Analysis) SVD

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Early methods of TopicModel-LSA (implicit Semantic Analysis) SVD

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support