LDA Thematic model

Source: Internet
Author: User

The PLSA model is based on the idea of frequency, each document's K-theme is fixed, each topic of the word probability is also fixed, we ultimately require a fixed topic-word probability model. Bayesian School Obviously does not agree, they believe that the subject of the document is unknown, the topic of the distribution of words is not known, we can not solve the exact value, can only calculate doc-topic probability model, topic-word probability model of probability distribution.

LDA Model Document Generation process

We make the Doc-topic probabilistic model for the Topic-word probabilistic model, each containing a K-dimension (K for the number of topic classes), each containing a V-dimension (V is the number of Word classes). In pLSA, the Doc-topic Dice is a K-term experiment, the Topic-word dice is a v-term experiment, so we use the K-dimensional Dirichlet distribution simulation of the prior distribution, v-dimensional Dirichlet distribution simulation of the prior distribution is a natural thing. Using the Bayesian remolding pLSA model, the new model is called the LDA model. The LDA model document generation legend is as follows

Since the Topic-word model is document-Independent, we generate K Topic-word dice from the Dirichlet distribution before all documents are generated. While the DOC-TOPIC model is related to each document, it is necessary to generate 1 Doc-topic dice from the Dirichlet distribution before generating each document. The LDA model generation document process is as follows

Physical decomposition of LDA model

Physical process decomposition

For the generation of the nth word in the article M document, we can decompose it into the following two processes

1.

Generate a Doc-topic dice m by Dirichlet (α) distribution, throw a doc-topic dice M, perform a K-key experiment, generate topic Z (1 <= z <= K).

2.

K-Topic-word Dice (numbered 1 to K) have been generated in advance by the Dirichlet (Beta) distribution, and the Z-Dice throw is selected for the V-key experiment to generate the word W.

Mathematical description of LDA model

For the first physical process is obviously dirichlet-multinomial conjugate structure

Compare the formula below (the formula is too lazy to knock, copy "Lda math gossip", is actually polynomial distribution in the Dirichlet distribution on the integral)

We have

Which indicates that the number of words in section M document K topic (that is, the number of occurrences of the K-topic throw, n is unknown to us). The posterior distribution of the parameters is

Because M-documents are independent of each other in corpus, we get M-independent dirichlet-multinomial conjugate structure, thus the probability of the whole corpus topic generation is

(1)

Since the probability distribution of topic-word is independent of the number of doc, we have K-Dirichlet distribution for K-Topic-word dice, so we should have a K-dirichlet-multinomial conjugate structure.

The LDA process in front of us is not good to find K-Topic-word Dice V-type experiment, we make a change.

1. Each word of each article is carried out one time doc-topic Dice polynomial throw experiment and one topic-word dice polynomial throw experiment.

Revision changed to

2. Each article first n times doc-topic Dice polynomial throw experiment, and then the n - time topic-word Dice-throwing experiment. (n is the number of words in an article)

Further modifications

3. The whole corpus advanced row n times doc-topic Dice polynomial throw experiment, and the experimental results into K class, each class corresponding to the same topic Results, and then in the K class, respectively , topic-word Dice polynomial throw experiment, a total n times. (n is the number of words for the entire corpus)

Here in the K class, respectively, the Topic-word dice polynomial throw experiment, is obviously K Topic-word Dice v-type experiment. The above procedure can be represented by the following two expressions

Z represents the results of the M-Doc-topic polynomial throw experiment (each article is sorted by a doc-topic dice, the results are classified), and W represents the K-Topic-word polynomial throw experiment (each subject with a Topic-word dice).

So a second physical process is also a dirichlet-multinomial conjugate structure

We have

Which indicates that the K-topic produces the number of Word t (n unknown), the posterior distribution is

Since the K-topic generate word is independent, so we get K independent dirichlet-multinomial conjugate structure, so the whole corpus word generation probability is

(2)

Since the generation of topic and word is independent, the synthesis (1) (2) has

Reference: "Lda math Gossip"

LDA Thematic model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.