Latent semantic analysis of subject model (latent Semantic analyst)

Source: Internet
Author: User
Tags idf

The topic model (Topic Models) is a machine learning model that attempts to discover a potential topic structure in a large number of documents by analyzing the words in the text to discover the topics in the document, the contact between topics, and the development of topics. Through the theme model, we can organize and summarize the massive electronic documents that cannot be manually annotated. Older topic models have mixed language models (Mixture of Unigram), potential semantic indexes (lantent Semantic Index,lsi), Probabilistic potential semantic indexing (probabilistic latent Semantic indexing , Plsi).
The document in the subject model is composed of a subject, and the subject is a probability distribution of the word. David Blei the latent Dirichlet Allocation (LDA) model on the basis of PLSI and LSI models, and LDA introduced the Dirichlet prior parameter as a prior distribution of the polynomial distribution, thus simplifying the probability derivation. It also solves the problem of scaling in the PLSI model.

Latent semantic analysis (latent Semantic analyst)

The method of vector space model can not deal with the problem of polysemy or multi-word, because human cognition is based on semantics rather than words, the word "flash" may appear simultaneously in two articles, but one is about "flash" software in computer technology and the other is about the research of lightning. If you analyze the word itself, it is possible to classify the two articles incorrectly. And the dimension of the word is too large, how to find some key words can be a piece of information compression is also a very difficult problem. Particularly in the today's explosion.

Latent semantic analysis is a method of automatic indexing and information retrieval, which maps documents and words to shallow in semantic space (latent Semantic spaces) by unsupervised method, which is called subject or semantic dimension.

Implicit semantic analysis employs a method of singular value decomposition (Singular value decomposition, SVD) for a document or word matrix. In general, the similarity between documents and documents, or documents and queries, is more reliable in the simplified unspoken semantic space. Because the singular value decomposition method is a sort of document feature, it can reduce the noise and dimensionality of the data by limiting the number of singular values. This method was put forward by Dumais and other people in 1988 to solve the problem of searching and mis-searching caused by the difference of words and human cognition meanings in keyword retrieval. In the middle is Susan Dumais,o (^.^) O.

Susan Dumais

The LSA uses a vector space model to map documents into matrices using the SVD decomposition matrix:

where the matrix and is an orthogonal matrix, the matrix is a diagonal array comprising the singular values of the document matrix.

Because the size of the singular value in the matrix represents the change in the size of the matrix in that dimension, and the singular values are arranged in the matrix in order from large to small. When the first k singular value is large, the first k singular value can be regarded as the approximation to the original matrix.

For example, the data focused on two topics of 9 micro-blog files, topic A is about the topic Eason Chan Concert, Topic B is about the topic of Google eyes. After the word segmentation will be able to get each word corresponding to the frequency of the occurrence of the microblog documents. As shown in the following table.

Topic A: Eason Chan concert
A1: Eason Chan's concert is too good, Big Love Eason Chan's "Ten Years"
A2: Favorite Eason "Ten years" and "because of Love"
A3: Look at Eason Chan's "Ten years", modern technology really powerful, the scene is awesome
A4: Beijing concert, Eason and Faye Wong duet "Because of Love"
A5: Concert Eason Chan in order to please to Faye Wong sing "Because of Love", specifically do not speak
Topic B: The advent of Google Glasses
B1: Google Glasses is coming on the market, can now apply for trial.
B2: New ideas of science and technology – Google Glass
B3:glass Creative Unlimited, you geek can find a way to try
B4: Google glasses can be applied for trial, belong to wearable technology products

The word frequency table can be regarded as the complete statistics of these 9 micro-blogs, the TF-IDF weight table of each term in each document in the table is calculated by the Frequency calculation table, and the singular value is decomposed by the weight table.

If the TF-IDF weights are set , then the singular value decomposition is:

The first two dimensions of the matrix and the singular value decomposition respectively, namely, the k=2, can be obtained and and . which can be expressed as the N article corresponds to a point on the first two dimensions of the distribution, you can draw and for two dimensions each article on the two dimensions of the position (as shown, the blue square represents the topic B of four Weibo, the red diamond represents the topic A of the five micro-blog). For a new article , you can calculate the distribution of a new article to the two dimensions. The Black circle is the new micro-blog: "Eason concert Faye Wong" To get the results, you can see in these two dimensions are very good to distinguish two topics open.


[1] Dumais S T. Latent semantic analysis[j]. Annual Review of Information Science and Technology, 2004, 38 (1): 188–230.

[2] Blei D M, Lafferty J. Topic Models[j]. Text mining:classification, clustering, and applications, 2009,10:71.

[3] steyvers M, Griffiths T. Probabilistic topic models[j]. Handbook of Latent Semantic Analysis, 2007, 427 (7): 424–440.

Latent semantic analysis of subject model (latent Semantic analyst)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.