LDA text Modeling (4)-algorithmic details, pseudo-code, implementation

Last Update:2018-07-26 Source: Internet

Author: User

Tags dashed line

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The above generation process is simply described as:

See below: http://www.flickering.cn/nlp/2015/03/peacock%EF%BC%9A%E5%A4%A7%E8%A7%84%E6%A8%A1%E4%B8%BB%E9%A2%98%E6%A8% A1%e5%9e%8b%e5%8f%8a%e5%85%b6%e5%9c%a8%e8%85%be%e8%ae%af%e4%b8%9a%e5%8a%a1%e4%b8%ad%e7%9a%84%e5%ba%94%e7%94%a8 /

second, what is the thematic model.

The following is an example of document modeling, a brief introduction to the topic model. 2.1 "Three processes" of the theme model

The topic model typically contains three important processes: the build process, the training process, and the online inference. The build process defines the assumptions and physical meanings of the model, and the training process defines how the model is derived from the training data, and online inference defines how the model is applied. The following is a brief introduction.

In general, the topic model is a generation model (the build model can be intuitively understood as a given model, and training samples can be generated). Given the model, the resulting process is shown in Figure 11: The model has 2 themes, topic 1 about the bank (the main word for loan, bank, money, etc.), topic 2 about the river (the main word for rivers, stream, bank, etc.). Document 1 Content 100% with regard to topic 1, the theme vector is <1.0, 0.0> the generation process for each word in the document is as follows: Choose topic 1 with 100% probability, then select the Word from Topic 1 with a certain probability. Document 2 content 50% about the topic 1,50% about topic 2, the theme vector is <0.5, 0.5> the generation of each word in the document is as follows: Choose Topics 1 and 2 with equal probability, and select the words from selected topics with a certain probability. Document 3 content 100% with regard to topic 2, the theme vector is <0.0, 1.0> the generation process for each word in the document is as follows: Choose Topic 2 with 100% probability, then select the Word from Topic 2 with a certain probability.
Figure 11 The generation process of the subject model [9]

The reality is that we do not have a model, only a huge amount of Internet document data, at this time we want to have a machine learning algorithm can be automatically from the training document data summed up the topic model (Figure 12), that is, each topic on the thesaurus of the specific distribution. In general, the training process also gets a byproduct-the subject vector of each training document.
Figure 12 Training process for the subject model [9]

With the theme model, given the new document, we can get the subject vector of the document by online inference (Figure 13). Figures 5, 6, 7 give some concrete examples.
Figure 13 Online inference of the subject model

In the three processes, the training process is difficult, the following article will be highlighted. 2.2 LDA model and its training algorithm

LDA (latent Dirichlet Allocation) [10], as an important thematic model, has aroused great concern from academia and industry since its publication, and the related papers have been emerging. The training algorithms of LDA are also varied, and the following is a brief introduction of the Gibbs sampling [11,12] as an example.
Figure LDA Training Process

Skipping the complex mathematical derivation, the LDA training process based on the Gibbs sample is shown in Figure 14 (each word is represented by W, the subject of each word is represented by Z, and the different colors of node z in the graph represent different topics): Step1: Initially, randomly assigns a subject Z to each word in the training corpus, and statistics two frequency count matrix: doc-topic Count Matrix NTD, describes the topic frequency distribution in each document; the Word-topic Count Matrix NWT, which represents the frequency distribution of the words under each topic. As shown in Figure 15, the two matrices correspond to the frequency count on the edges of the graph, respectively. STEP2: Traverse the training corpus and re-sample each of the words w corresponding subject Z According to probability, update NWT and NTD synchronously. STEP3: Repeat Step2 until the NWT converges.

STEP2 the resampling word w corresponds to topic Z, the sampling formula is
P (z=t|w,∗) =n¬wt+βn¬t+βv⋅n¬td+αtld–1+∑tαt∝n¬wt+βn¬t+βv (n¬td+αt) (1)
Where αt and β are hyper-parameters, respectively, the probability of the frequency count in Ntd and NWT is smoothed; v is the thesaurus size, LD represents the document D length, NWT represents the number of occurrences of the subject T morphemes W in the training Corpus, and NT represents the number of occurrences of the subject T in the training corpus, and NTD represents the The current number of times, the upper corner of the subscript is to reject the influence of the present sampling word w (for example, n¬td means subtracting the topic of the current sample word, the number of occurrences of topic T in document D).

Figure 15 Document D1 morphemes W subject resampling

In fact, the above formula for resampling of the subject Z of document D morphemes W has a very clear physical meaning, representing P (w|z) p (z|d), which can be visualized as a process of "path selection" as shown in Figure 15: The current word w in current document D (shown in bold in Figure 15), the "old" of the word W Topic Z gives a path to the d-z-w (Figure 15 (1) dashed line), the "old" theme Z for the Reject word w, the update count in NWT and NTD (Figure 15 (1) doing "two" on the 1 side of the old path); calculates the probability of each possible path of the d-z-w, The probability of the d-z-w path equals the product of the two-part path probabilities of d-z and z-w that is P (z|d) p (w|z), P (z|d) and NTD, P (w|z) and NW

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More