Topic model-lda Analysis

Source: Internet
Author: User

Last month, I attended the sigkdd International Conference in Beijing. I mentioned the LDA model in the workshop of Personalized recommendations, social networks, advertising prediction, and other fields. I feel that this model is widely used, after the meeting, I took the time to learn about Lda and summarized it:

(1) Role of LDA

The traditional way to judge the similarity between two documents is to view the number of words that appear in the two documents, such as the TF-IDF, this method does not take into account the semantic association behind the text, there may be few or none words in the two documents, but the two documents are similar.

For example, there are two sentences:

"Steve Jobs left us ."

"Will Apple prices fall ?"

We can see that the above two sentences do not share the same word, but these two sentences are similar. If the traditional method is used to judge that these two sentences are not similar, therefore, when determining the relevance of a document, we need to consider the semantics of the document. The powerful tool of semantic mining is the topic model, and Lda is one of the more effective models.

In a topic model, a topic represents a concept and an aspect. It represents a series of related words, which are the conditional probabilities of these words. In terms of image, a topic is a bucket containing words with high probability. These words are highly correlated with the topic.

How can a topic be generated? How should I analyze the topic of an article? This is a problem to be solved by the topic model.

First, you can use the generated model to view the document and topic. The so-called model generation means that we thinkEach word in an article selects a topic with a certain probability and selects a word with a certain probability"This process is achieved. If we want to generate a document, the probability that each word in it appears is:


This probability formula can be expressed in a matrix:


The document-word matrix indicates the word frequency, that is, the probability of occurrence, of each word in each document. The topic-word matrix indicates the probability of occurrence of each word in each topic; the "document-topic" matrix indicates the probability that each topic appears in each document.

Given a series of documents, you can use word segmentation to calculate the word frequency of each word in each document and obtain the "document-word" Matrix on the left. The topic model is trained through the matrix on the left to learn the two matrices on the right.

There are two topic models: plsa (probabilisticlatent Semantic Analysis) and LDA (latent Dirichlet allocation). The following describes lda.

(2) LDA Introduction

How to generate M documents containing N words, latentdirichlet allocation this article introduces 3 methods:

Method 1: unigram Model

This model generates a document using the following method:

For each ofthe n words w_n:
Choose a word w_n ~ P (w );

Where n indicates the number of words in the document to be generated, w_n indicates the N words W, P (w) indicates the distribution of w words, which can be obtained through the corpus statistical learning, for example, you can give a book the probability that each word appears in the book.

This method obtains the probability distribution function of a word through the training corpus, and then generates a word each time based on the probability distribution function. This method is used to generate M documents m times. Shows the graph model:


Method 2: mixture of unigram

The disadvantage of the unigram model method is that the generated text has no subject and is too simple. The mixture of unigram method has improved it. This model uses the following method to generate a document:

Choose a topicz ~ P (z );

For each ofthe n words w_n:

Choose a word w_n ~ P (w | z );

Where Z represents a topic, P (z) represents the probability distribution of the topic, z is generated by P (z) based on probability; N and w_n are the same as above; P (w | z) it indicates the distribution of W at given Z. It can be seen as a matrix of K × V. K indicates the number of topics and V indicates the number of words. Each line indicates the probability distribution of words corresponding to the topic, that is, the probability of each word contained in theme Z. Each word is generated based on a certain probability through this probability distribution.

In this method, select a topic Z, and the topic Z corresponds to the probability distribution P (w | z) of a word. Each time a word is generated based on this distribution, use MB to generate m different documents. Shows the graph model:


It can be seen that Z is outside the rectangle where W is located, indicating that Z generates a document with N words only once, that is, only one document has one topic, this is not suitable for general situations. Generally, a document may contain multiple topics.

Method 3: LDA (latent Dirichlet allocation)

The LDA method allows the generated document to contain multiple topics. This model uses the following method to generate one document:

Chooseparameter θ ~ P (θ );

For each ofthe n words w_n:

Choose a topic Z_N ~ P (z | θ );

Choose a word w_n ~ P (w | z );

Where θ is a topic vector, and each column of the vector represents the probability that each topic appears in the document. This vector is a non-negative normalized vector, and P (θ) is the distribution of θ, the Dirichlet distribution is the distribution. N and w_n are the same as above. Z_N indicates the selected topic, and P (z | θ) indicates the probability distribution of theme Z given θ, the specific value is θ, that is, P (Z = I | θ) = θ _ I; P (w | z) is the same as above.

This method first selects a topic vector θ to determine the probability of each topic being selected. Then, when generating each word, select a topic Z from the topic distribution vector θ and generate a word based on the word probability distribution of topic Z. Shows the graph model:


We can see that the joint probability of LDA is:


Map the above formula to the graph, which can be roughly understood as follows:


It can be seen that the three presentation layers of LDA are represented by three colors:

1. Corpus-level (red): α and β represent corpus-level parameters, that is, each document is the same, so the generation process only samples once.

2.doc ument-level (orange): θ is a variable at the document level. Each document corresponds to a θ, that is, the probability that each document generates a topic Z is different, sample θ once for all generated documents.

3. Word-level (green): both Z and W are word-level variables. Z is generated by θ, W is generated by Z and β, and a word W corresponds to a topic Z.

Based on the above discussion of LDA generation model, we can know that LDA model is mainly used to learn and train two control parameters α and β from the given input corpus, after learning these two control parameters, the model is determined and can be used to generate documents. Alpha and beta correspond to the following information:

α: distribution P (θ) requires a Vector parameter, I .e., the Dirichlet distribution parameter, which is used to generate a topic θ vector;

β: P (w | z), the probability distribution matrix of words corresponding to each topic ).

Using w as the observed variable, θ and Z as the hidden variables, we can use the EM algorithm to learn α and β. The posterior probability P (θ, z | W) is encountered during the solution) it cannot be solved directly. We need to find a lower bound of the likelihood function to approximate the solution. The original Article uses the factorization-based variational method (varialtional inference) for calculation and uses the EM algorithm. Input α and β in each E-step, calculate the likelihood function, and m-step maximize the likelihood function, calculate α and β, and iterate continuously until convergence.

 

References:

David M. blei, andrewy. Ng, Michael I. Jordan,
Latentdirichlet allocation
, Journal of machine learning research 3, p993-1022, 2003

[Jmlr '03] latent Dirichlet allocation (LDA)-David M. blei

The mysteries behind search-a discussion on semantic topic computing

Http://bbs.byr.cn /#! Article/pr_ai/2530? P = 1


Reprinted please indicate the source, original address: http://blog.csdn.net/huagong_adu/article/details/7937616

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.