LDA was a mentor in the early October, and each time he picked up the "Lda math gossip" to see the formula deduced in front of it was a difficult problem, dragged until the end of October. This weekend took two days to finally understand the LDA, in fact, LDA is a very simple model, do not be frightened by the preceding mathematical formula. Of course, as a beginner, if there is any understanding of the wrong, you are welcome to criticize correct.
Unlike the LDA math gossip, I want to start with this model.
Now I have m articles, these articles are made up of V words. Each word may belong to a different topic, the total number of topics is K (we don't know what the topic is for each word). Now the subject of a new article is derived from the existing corpus, and this article is made up of some of the five words.
Based on the idea of machine learning, we need to get a model to determine the parameters in the model through the existing corpus, and then generate a new topic distribution through the model.
This is an unsupervised learning, and the method of estimating the parameters in the model is the stationary distribution of Markov chains. The probability of Markov chain is only related to its previous state, regardless of the initial state, by the previous transfer matrix, a number of steps will converge to a state, called this state is the smooth distribution of the Markov chain. If the distribution converges to P (x), then the transfer sequence obtained when the Markov chain converges is a sample of the distribution P (x). This is the famous MCMC. Gibbs sampling is the optimization of the MCMC method, the acceptance probability α in the MCMC to 1, the state transfer to the transfer of the extension axis, in the K-dimensional axis along the K-axis rotation sampling, the convergence of the sample is the P (x1,x2,..., xn) samples.
LDA uses this kind of Gibbs sampling sampling method, first the topic distribution of each article and the lexical distribution of each topic is set to random values, according to the above method is sampled continuously, until the convergence, the lexical distribution of each topic can be used to generate a new article theme distribution! (the subject's lexical distribution is common) with the model parameters, the new topic distribution to the random initial value, the topic of the vocabulary distribution fixed, according to the above method of continuous sampling, Gibbs sampling convergence, the new topic distribution is obtained.
This is all of LDA's content! Of course, it may still be confusing, exactly how to sample. This is about gamma functions, beta distributions, Dirichlet distributions, two distributions, multiple distributions, and their conjugate distributions.
The distribution of the topics in the generated articles and the distribution of the vocabularies in the topics are Dirichlet distribution, with the topic distribution, the topic of the generation of J words from the topic distribution and the generation of the final words from the distribution of words are polynomial distributions. The following is the polynomial distribution is obvious, as to why the front is dirchlet distribution, because they are conjugate distribution ...
What's the use of knowing it? In order to derive the formula for the final parameter, the push-to-process is not listed here, and the final parameter sampling is reflected in the code as
for (int k = 0, k < K; k++) {for (int t = 0; t < V; t++) { phi[k][t] = (Nkt[k][t] + Beta)/(Nktsum[k] + V * beta); } } for (int m = 0, M < m; m++) {for (int k = 0; k < K; k++) { theta[m][k] = (Nmk[m][k] + Alpha)/(Nmksum[m] + K * alpha); } }
Where Phi is the lexical distribution of the topic, Theta is the subject distribution of the document. NMK is the total number of words in the article m subject to K, and NKT is the total number of words t in the subject K. Nmksum is the total number of all topic words in the article M, and Nktsum is the total number of all words in the subject K. The subject of each word is updated every time to
private int Sampletopicz (int m, int n) {//TODO auto-generated Method Stub//Sample from P (z_i|z_-i, W) u Sing Gibbs upde rule//Remove topic label for w_{m,n} int oldtopic = Z[m][n]; nmk[m][oldtopic]--; nkt[oldtopic][doc[m][n]]--; nmksum[m]--; nktsum[oldtopic]--; Compute p (z_i = k|z_-i, W) double[] p = new Double[k]; for (int k = 0; k < K; k++) {p[k] = (Nkt[k][doc[m][n]] + Beta)/(Nktsum[k] + V * Beta) * (Nmk[m][k] + alpha )/(Nmksum[m] + K * alpha); }//Sample a new topic label for W_{M, n} like roulette//Compute cumulated probability for P for ( int k = 1; K < K; k++) {P[k] + = p[k-1]; } Double U = math.random () * p[k-1]; P[] is unnormalised int newtopic; for (newtopic = 0, newtopic < K; newtopic++) {if (U < P[newtopic]) {break; }}//ADD new topic LAbel for W_{m, n} nmk[m][newtopic]++; nkt[newtopic][doc[m][n]]++; nmksum[m]++; nktsum[newtopic]++; return newtopic; }
Finally, we get the topic distribution of the document, the lexical distribution of the topic, and the topic of each word in the article.
The most obvious application of LDA is information retrieval, which calculates the topic distribution for each retrieved document, and returns similar documents based on the subject distribution of the retrieved content and the distance of the subject distribution of the known document. More advanced applications are still being explored.
Some specific formulas in LDA can be found in the reference literature. With the overall understanding, understand all the derivation is not a problem!
Reference: LDA Math gossip
"Parameter Estimation for text analysis"
ldagibbssampling-master Code by Liuyang
Talking about LDA