Latent Dirichlet Allocation

Source: Internet
Author: User

Topic Model

LDA is a thematic model, and there is a very popular example of the interpretation of the thematic model:

The first one is: "Jobs left us." ”
The second one is: "Will the price of apples fall?" ”

We can see at a glance that these two words are related, the first sentence has "jobs", we will naturally understand "Apple" as Apple's products, they belong to the same theme: Apple company.

And like my previous calculation of relevance, the more words that are duplicated between documents, the more likely they are to be similar, the effect cannot be achieved. The more words you repeat between documents, the more likely they are to be similar, which is not necessarily true. Many times the relevance depends on the semantic link behind it-the topic description, not the superficial repetition of words.

LDA differs from the above approach in that the above method deals directly with the relationship between words and articles, while the LDA process adds a layer of topics between words and articles, namely, document-subject, subject-The path of the word.

Overview of Lda

Many of the articles about LDA are based on the thinking of academics and science students, that is, the rigorous nature of mathematical formulae to derive the entire LDA process. I watched for a long time to clear up this process, because it involves a lot of mathematical deduction, it is easy to fall into the formula derivation of the vicious circle, unable to the whole lda of the idea of a whole grasp ...

July blogger says Lda is divided into five parts:
-A function: Gamma function
-Four distributions: two distributions, multiple distributions, beta distributions, Dirichlet distributions
-a concept and a concept: conjugate priori and Bayesian frameworks
-Two models: pLSA, LDA
-One sample: Gibbs sampling

Indeed, if you want to fully clarify the context, the above five parts are not few. Now I am trying to reverse this process, that is, the engineering thinking to explain why LDA needs the above 5 parts, that is, from the Gibbs sampling start.

Great statistical simulations

What are the requirements of the LDA model? That is to get the document-the probability distribution of the topic, the topic-the probability of the word step, how to do it? The intuitive approach is to:

Document-topic: Calculate the number of words in a document that are in the Ki topic/The total number of document words
Topic-Words: Calculating the number of words WI belongs to KJ topic/Total number of words that belong to KJ topic

The problem is, for a document set we are very difficult to know each word corresponding to which topic, then the above formula can not be calculated, then how to solve this problem?? This is going to take advantage of the statistical simulation process.

"Lda math Gossip," the article in the creation process of the image interpretation into the game of God Roll dice, and in the book image of the description of several simulation process: Unigram Model, coupled with Bayesian Unigram Model,plsa, coupled with Bayesian pLSA. Here are some of the main steps I'll look at:

Unigram Model

The simplest Unigram model thinks God creates an article with the following rule:

1: God has only one dice, the dice have a V face, each face represents a word, the probability of each face varies.
2: Each time you throw the dice, the face that is thrown corresponds to the word. So if an article has n words, then this article is the result of God's independent throw of the N-time dice.

The probability of the occurrence of each surface of the dice is defined as P (P1, p2, p3, ..., PV), then the probability that a document D (W1, W2, W3, ..., WN) is generated is

and documents and documents we think are independent, so if there are multiple documents W (W1, W2, ..., WM) then the probability of this document set is:

Assuming that the total number of words in the document set is N, we pay attention to the number of occurrences of each word, ni, then N (N1, N2, ..., NV) is a polynomial distribution

Our task is to obtain the probability distribution of the word in the document through this simulation process, that is, the probability of each face of the dice is how large, then the maximum likelihood to maximize the probability, so the value of the parameter pi is

Through the simulation process, we can get the document – the probability of the word step, i.e. P (p1, p2, ..., PV).

on the understanding of maximum likelihood

Suppose there is a black box filled with colorful balls. We take a random ball out of this box and write the color down, then put it back in the box and repeat the action 100 times. Finally statistics we take out the number of red, if it is 80, then we are more confident that the proportion of red ball in the black box is about 80%, this is the concept of maximum likelihood.

plus the Unigram Model of Bayesian.

The Bayesian school thinks that God has only one dice is unreasonable, they think God has an infinite number of dice, God created an article under the following rules:

1: God has a big jar with a lot of dice, each with a V face
2: God first Take out a dice from the jar, and then constantly throw, the result of the dice thrown is to create an article

This differs from the Unigram model in that God has a lot of dice, which match a probability distribution. Then in the Bayesian framework, there are the following relationships:

Priori probability * likelihood function = posteriori probability

So how do you choose the prior probabilities of the dice here? Because this problem is actually the probability of calculating a polynomial distribution, the Dirichlet distribution is exactly the conjugate distribution of the multiple distributions. Here we consider the prior probability of the dice as Dirichlet distribution. So now the Bayesian framework has become

Dirichlet Distribution * Multi-item distribution data = Posterior distribution as Dirichlet distribution

In the given parameter P's prior distribution Dir (P | a), the probability of each word appearing is a polynomial distribution n ~ Mult (n | p, n), we can easily find its posterior distribution for Dir (P | a+n)

Our main task is to estimate the probability distribution of the sieve V surface, since we have the posterior distribution of the parameters, so the reasonable way is to use the maximum point of the posterior distribution, or the average value of the parameter in the posterior distribution, here we take the average calculation:

pLSA Model

The first 2 are the bedding, and now finally involves the statistical simulation of the thematic model. In pLSA (probabilistic latent Semantic analysis), I think God created an article like this:

1: God has 2 kinds of dice, one is doc-topic type dice, it has a K-sided, each face corresponding to an article theme. One is Topic-word dice, a total of k, there are v faces, each face one word
2: God first throws Doc-topic dice get a topic number Ki and then select the Ki Topic-word dice to throw and get a word
3: If this article has n words, repeat the last procedure n times

plus the pLSA of Bayesian.

Similar to the Unigram model, the Bayesian school believes that God's doc-topic and Topic-word dice have a priori probability distribution, that is, the 2 Dirichlet prior probability distributions are finally our LDA model. Now the creative process of the article has become:

1: First randomly extracted k Topic-word dice, numbered 1 to K
2: When creating a new article, randomly select a Doc-topic dice, and then repeat the following procedure to generate all the words in the article
3: Throw the Doc-topic dice, get the subject number Z, and then select the Topic-word dice throw that is numbered Z, generate a Word

So how do you calculate the probability distribution of a topic to a word in a document? Similar to the priori Bayesian Unigram model, the article-the probability of the subject is calculated as follows:

Topic-The probability of a word is calculated as follows:

NM represents the number of words in article M, NK represents the number of words in the K-topic, the z-vector represents the topic distribution, and the W vector represents the distribution of the word.

Note:
Here I omitted a lot of formula derivation process, most of the deduction, I have not read it.

Alpha and beta in the formula are called hyper-parameters, and they are parameters in the Dirichlet distribution. Their role is explained as follows:
Alpha is smaller, so that the same document has only one subject, and beta becomes smaller, so that a word belongs to the same subject as much as possible.

Gibbs Sampling

Now that the entire simulation process is well understood, how do you get the computer to perform this process? Gibbs sampling is the process of doing this, but before the official description of Gibbs sampling, there is another foreplay, that is MCMC (Markov Chain Monte Carlo).

Markov chain

The mathematical definition of Markov chain is relatively simple

The probability of a state transition depends only on the probability of the previous state.

It has a very important property: if the transfer matrix satisfies the detailed stationary condition, then the Markov chain will eventually converge. The probability distribution of convergence depends only on the transfer probability matrix, which is independent of the initial distribution, which is suitable for generating the corresponding sample in the case of a given probability distribution.

Note:

Detailed and smooth conditions: the probability of moving from state A (x1, y1) to State B (x1, y2) = State B (x1, y2) is transferred to the probability of State A (x1, y1) as follows:

transforming the Markov chain: MCMC

For a given probability distribution P (x), we want to have a convenient way to generate its corresponding sample. Since the Markov chain can converge to a smooth distribution, so a very beautiful idea is: if we can construct a transfer matrix of P's Markov chain, so that the stable distribution of the Markov chain is exactly P (x), then we start from any initial state x0 along the Markov chain transfer, to obtain a transfer sequence X0,X1,X2,?XN , xn+1?, if the Markov chain is already converging at nth step, then we get a sample of π (x) xn,xn+1?

Because the Markov chain can only converge if it satisfies the detailed and stable conditions, the corresponding sample of the given probability distribution can be achieved. In the MCMC method, a variable acceptance rate alpha is introduced:

How does alpha value make the above equation equal? The simplest, in terms of symmetry, can be a value:

So that we can meet the detailed and stable conditions

So we have the original transfer matrix Q of a very common Markov chain, transformation in order to have a transfer matrix q′ of the Markov chain, and q′ just meet the detailed stability conditions, so the stable distribution of q′ chain is P (x)!

Because the acceptance rate alpha is sometimes too small to cause a denial of a jump, it is difficult to converge. This time the Alpha (i,j) and Alpha (j,i) are scaled up until one side reaches 1. The most common metropolis-hastings algorithm is obtained by the small modification of the acceptance rate of the MCMC algorithm.

Gibbs Sampling

The Gibbs sample is similar to the MCMC process, except that it can be time-consuming to enlarge the alpha acceptance rate. And Gibbs sampling can be very convenient to achieve the acceptance rate of 1, that is, only one-dimensional direction of the transfer. For example, there is a probability distribution P (x, Y) that examines the same x-coordinate two points a (x1, y1) B (x1, y2)

That

Note that the transfer along the axis x is in accordance with the detailed and stable conditions, notice that at this time we do not apply the acceptance rate of this parameter, for the axis along the y-axis condition also satisfies the detailed stable condition.
Then the Gibbs sample is rotated along the x-axis or y-axis during the sampling process.

application of Gibbs sampling algorithm in document

It can be learned from the above that, when we know the prior probability distribution (Dirichlet distribution) of a document, Gibbs sampling can simulate the process of document generation. The final result of the formula derivation of LDA's Gibbs sampling is as follows:

The algorithm is as follows:

1: Because the stable state of the Markov chain is independent of the initial distribution, we can randomly assign a topic Z to each word w
2: Re-scan the document set and resample the topic of each word with the Gibbs sampling formula
3: Repeat the process until the convergence is reached
4: Probability of Doc-topic and Topic-word in a statistical document set

Summary

In learning the early days of LDA, I was wondering what the theme parameters for this model input are?? is actually a document set roughly contains the number of topics, hyper-parameter alpha and beta determine the document-subject and subject-the prior probability of the word, the law is as follows:

Alpha is smaller, so that the same document has only one subject, and beta becomes smaller, so that a word belongs to the same subject as much as possible.

about how to choose the number of topics and the setting of super-parameters, is still in the process of learning ...

The five processes mentioned in July are not related to the Gamma function, the two-item distribution, and the beta distribution, because the multiple distributions that preceded the prior probabilities were evolved from the two distributions, and here I jump directly. The beta distribution is a conjugate distribution of two distributions, and the relationship between the beta distribution and the two-item distribution is similar to the Dirichlet distribution and the polynomial distribution.

The gamma function is a factorial representation that describes a number

It can be used to calculate the factorial of a non-integer, because the probability formula of the two-item distribution, beta distribution, polynomial distribution, and Dirichlet distribution can be represented by the gamma function, and the gamma is eliminated during the derivation of the Gibbs sampling formula. So in order to grasp the LDA process as a whole, I don't have the knowledge to describe

The next part I decided to write down my LDA model implementation process, as well as the model of the topic number selection principles and model evaluation indicators.

Reference

[1] LDA math gossip
[2] CSDN July_ Popular Understanding LDA
[3] Blog Park entropy_ Theme Model explanation
[4] Blog Park YWL925_MCMC Random Simulation

Latent Dirichlet Allocation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.