Talking about LDA

Last Update:2016-10-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

LDA was a mentor in the early October, and each time he picked up the "Lda math gossip" to see the formula deduced in front of it was a difficult problem, dragged until the end of October. This weekend took two days to finally understand the LDA, in fact, LDA is a very simple model, do not be frightened by the preceding mathematical formula. Of course, as a beginner, if there is any understanding of the wrong, you are welcome to criticize correct.

Unlike the LDA math gossip, I want to start with this model.

Now I have m articles, these articles are made up of V words. Each word may belong to a different topic, the total number of topics is K (we don't know what the topic is for each word). Now the subject of a new article is derived from the existing corpus, and this article is made up of some of the five words.

Based on the idea of machine learning, we need to get a model to determine the parameters in the model through the existing corpus, and then generate a new topic distribution through the model.

This is an unsupervised learning, and the method of estimating the parameters in the model is the stationary distribution of Markov chains. The probability of Markov chain is only related to its previous state, regardless of the initial state, by the previous transfer matrix, a number of steps will converge to a state, called this state is the smooth distribution of the Markov chain. If the distribution converges to P (x), then the transfer sequence obtained when the Markov chain converges is a sample of the distribution P (x). This is the famous MCMC. Gibbs sampling is the optimization of the MCMC method, the acceptance probability α in the MCMC to 1, the state transfer to the transfer of the extension axis, in the K-dimensional axis along the K-axis rotation sampling, the convergence of the sample is the P (x1,x2,..., xn) samples.

LDA uses this kind of Gibbs sampling sampling method, first the topic distribution of each article and the lexical distribution of each topic is set to random values, according to the above method is sampled continuously, until the convergence, the lexical distribution of each topic can be used to generate a new article theme distribution! (the subject's lexical distribution is common) with the model parameters, the new topic distribution to the random initial value, the topic of the vocabulary distribution fixed, according to the above method of continuous sampling, Gibbs sampling convergence, the new topic distribution is obtained.

This is all of LDA's content! Of course, it may still be confusing, exactly how to sample. This is about gamma functions, beta distributions, Dirichlet distributions, two distributions, multiple distributions, and their conjugate distributions.

The distribution of the topics in the generated articles and the distribution of the vocabularies in the topics are Dirichlet distribution, with the topic distribution, the topic of the generation of J words from the topic distribution and the generation of the final words from the distribution of words are polynomial distributions. The following is the polynomial distribution is obvious, as to why the front is dirchlet distribution, because they are conjugate distribution ...

What's the use of knowing it? In order to derive the formula for the final parameter, the push-to-process is not listed here, and the final parameter sampling is reflected in the code as

   for (int k = 0, k < K; k++) {for            (int t = 0; t < V; t++) {                phi[k][t] = (Nkt[k][t] + Beta)/(Nktsum[k] + V * beta);            }        }        for (int m = 0, M < m; m++) {for            (int k = 0; k < K; k++) {                theta[m][k] = (Nmk[m][k] + Alpha)/(Nmksum[m] + K * alpha);            }        }

Where Phi is the lexical distribution of the topic, Theta is the subject distribution of the document. NMK is the total number of words in the article m subject to K, and NKT is the total number of words t in the subject K. Nmksum is the total number of all topic words in the article M, and Nktsum is the total number of all words in the subject K. The subject of each word is updated every time to

 private int Sampletopicz (int m, int n) {//TODO auto-generated Method Stub//Sample from P (z_i|z_-i, W) u        Sing Gibbs upde rule//Remove topic label for w_{m,n} int oldtopic = Z[m][n];        nmk[m][oldtopic]--;        nkt[oldtopic][doc[m][n]]--;        nmksum[m]--;        nktsum[oldtopic]--;        Compute p (z_i = k|z_-i, W) double[] p = new Double[k]; for (int k = 0; k < K; k++) {p[k] = (Nkt[k][doc[m][n]] + Beta)/(Nktsum[k] + V * Beta) * (Nmk[m][k] + alpha        )/(Nmksum[m] + K * alpha); }//Sample a new topic label for W_{M, n} like roulette//Compute cumulated probability for P for ( int k = 1; K < K;        k++) {P[k] + = p[k-1]; } Double U = math.random () * p[k-1];        P[] is unnormalised int newtopic;            for (newtopic = 0, newtopic < K; newtopic++) {if (U < P[newtopic]) {break; }}//ADD new topic LAbel for W_{m, n} nmk[m][newtopic]++;        nkt[newtopic][doc[m][n]]++;        nmksum[m]++;        nktsum[newtopic]++;    return newtopic; }

Finally, we get the topic distribution of the document, the lexical distribution of the topic, and the topic of each word in the article.

The most obvious application of LDA is information retrieval, which calculates the topic distribution for each retrieved document, and returns similar documents based on the subject distribution of the retrieved content and the distance of the subject distribution of the known document. More advanced applications are still being explored.

Some specific formulas in LDA can be found in the reference literature. With the overall understanding, understand all the derivation is not a problem!

Reference: LDA Math gossip

"Parameter Estimation for text analysis"

ldagibbssampling-master Code by Liuyang

Talking about LDA

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Talking about LDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Talking about LDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support