--lda topic Clustering Model for natural language processing

Source: Internet
Author: User

Introduction to LDA Model algorithm:

The input to the algorithm is a collection of documents D={D1, D2, D3, ..., DN}, which also requires a clustering of the class number m; then the algorithm will each document DI on all topic a probability value p, so that each document will get a set of probabilities di= (DP1,DP2, ... , DPM); all the words in the same document will also find the probability that it corresponds to each topic, WI = (WP1,WP2,WP3,...,WPM); so we get two matrices, one document to topic, one word to topic.

In this way, the LDA algorithm projects the document and the word into a set of topic, trying to find out the potential relationship between the document and the word, between the document and the document, and the word in the topic, because LDA belongs to the unsupervised algorithm, each topic does not require the specified condition, but after clustering, By counting the probability distributions of the words on each topic, those words with high probability on the topic can describe the meaning of the topic very well.

LDA Model Construction Principle:

Before the LDA model is introduced, the following Unigram model (Word bag model), Bayes Unigram model (Bayesian word bag models), and PLSA probabilistic latent semantic analysis are introduced first, they are the foundation of LDA model, LDA is the result of combining and evolving them; second, these models are simpler and easier to understand.

1. Unigram model (Word bag models)

Since LDA is a clustering algorithm, the clustering algorithm most of the time is looking for the similarity of two things.

In the beginning, you want to determine whether the two documents are similar, the simplest and most straightforward way is to see whether the word appears in the document is the same, the number is similar. In the Unigram model (Word bag models) is to achieve such a concept of design. So in order to get the common law of all documents in the document collection, the word bag model, suppose: the process of creating a document is to throw a dice with M face independently (M is the number of all words), n times (n is the number of words in the document), so that the creation of a document can be considered as a polynomial distribution:

        

In the document collection, the probability that each word appears is the required parameter, which can be determined by the EM algorithm so that the model is obtained.

2. Bayes Unigram model (Bayesian word bag models)

In the word bag model, we simply think that the probability of each word appearing in the document is a constant (that is, the probability of each face of the dice), but the Bayes school does not think so, they think that these probabilities should be a random process generated, so the process of generating a document can be described as: a random M-face of the dice, Then use this dice to throw n times independently. So the distribution of this model is as follows:

        

The back part, which is the polynomial distribution, we already know, in order to facilitate the calculation we assume the Dirichlet distribution, it is a polynomial distribution of the coaxial distribution

Brief introduction of the following Dirichlet distribution: for example, 100 times the dice, get 6 faces a probability, recorded as an experiment, repeat the experiment 100 times, then this 100 times, the probability of the 6 surface probability distribution, is Dirichlet distribution, it is distributed above the distribution.

For example: 1 points (dice six faces one) in these 100 experiments (100 times per experiment) is 0.15 of the probability of 0.12, actually we think so, 100 experiments, 12 times, 1 points in an experiment appeared 15 times, can be seen as a total of 10,000, 1 points appear 15x12= 180 times. These 10,000 experiments, as a large polynomial distribution, can be concluded that they have the same probability distribution formula, which is the previously mentioned common axis distribution, and has the following properties:

Transcendental Dirichlet distribution + polynomial distribution = posterior Dirichlet distribution

        

In the example above, you will find that it is already very similar to our Bayes Unigram model (Bayesian word bag models). An experiment of 100 dice, can be seen as a priori Dirichlet distribution, that is, the model to determine the probability of each face of the random process of the dice, and repeat this experiment 100 times, can be seen as the back of the dice based on the process of determining the document.

The Dirichlet distribution also has an important property, its maximum likelihood estimate can pass the following formula, proves the process is somewhat complex, temporarily does not deduce:

      

3. Latent semantic analysis of pLSA

When it comes to text clustering, there are often problems such as referring to "Stone Buddha" in the relevant NBA news, and referring to "Duncan" as they should refer to the same person, indeed two different words, while another article on education mentions "Duncan", but this "Duncan" is not his "Duncan", It may refer to the U.S. Department of Education minister "Arne Duncan", and the two NBA news and an educational news, it is likely to be the wrong cluster.

Thus, it can be found that the word in different semantic environment, the same word may express different meanings, and the same meaning may produce different words. pLSA Latent semantic analysis is to solve such a problem. It adds a layer of theme (Topic) between the document and the word, first associating the document with the subject, and then finding the probability distribution of the word in the subject.

The PLSA model creates a document like this: the first step, we throw a dice with H face, each face represents a theme, the probability of each face is different, to get a theme; The second step, this theme corresponds to a T-face of the dice, each face represents a word, throw the dice n times, get an article. In fact, I think this model can be seen as a combination of two word bag models, the first one to do, to determine the theme, the second to repeat independently do n words, determine the article. Here is a visual graph (borrowed from the diagram of LDA math gossip):

      

The formula for this probability distribution is as follows:

  

LDA Thematic Clustering model

Then the Bayes School of Friends appear again, history is so similar, they again to pLSA, think pLSA inside two kinds of dice (produce the theme of the dice and the theme of the dice), the probability of each face should not be determined, should be a random process to draw. So let pLSA's two word bag model into two Bayes word bag model, that is LDA

As already mentioned, the probability distribution of Bayes word bag model is a Dirichlet coaxial distribution, and the whole physical process of LDA is actually two Dirichlet coaxial distribution, and the parameters estimation of LDA model is also out, through that important property, as follows:

        

LDA algorithm design and Gibbs sampling

Algorithm steps:

1. For each document in the collection of documents D, do participle, and filter out meaningless words, get corpus set w = {w1, w2, ..., WX}.
2. Statistics on these words are obtained by P (wi|d).
3. For each WI in the corpus set, a topic T is randomly assigned as the initial theme.
4. Using the Gibbs sampling formula, resample the subject t of each w and update it in the corpus until the Gibbs sampling converges.
Convergence to get the topic-the probability matrix of the word, this is the LDA matrix, and the document-subject probability matrix is also available, statistics, you can get the document-topic probability distribution.

Gibbs Sampling formula:
The Gibbs sampling formula can be used to calculate the probability of a transfer between two parallel points in the space of an X dimension. For example, in two-dimensional space (x, y plane), the probability that point A (x1,y1) is transferred to B (X1,y2) is recorded as P,p (a->b) = P (y2|x1)

So the 4th step above, can be seen as a word corresponding to the document and the probability of topic as a point in the two-dimensional plane of the two dimensions, the word in different documents and different topics, through the Gibbs sampling formula, continuous transfer (that is, resampling), until convergence. Here is a diagram of the convergence of the Gibbs sampling formula, which gives you a visual impression (from LDA math gossip).

LDA (latent Dirichlet Allocation) Learning notes

Example

What LDA is going to do is simply to cluster a bunch of documents (so non-supervised learning), a topic is a class, and the number of topic to be clustered is specified beforehand. The result of clustering is a probability, not a Boolean of 100% belonging to a class. There is a clear example of a blog abroad [1] that directly refers to:

Suppose you have the following set of sentences:

    • I like to eat broccoli and bananas.
    • I ate a banana and spinach smoothie for breakfast.
    • Chinchillas and kittens are cute.
    • My sister adopted a kitten yesterday.
    • Cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It's A-automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

    • Sentences 1 and 2:100% Topic A
    • Sentences 3 and 4:100% Topic B
    • Sentence 5:60% Topic A, 40% Topic B
    • Topic a:30% Broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which, could interpret topic
    • Topic b:20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, could interpret topic B-to is about cute animals)

As a result of sentence 5 above, it can be seen that the clustering results of a distinct probability type (sentence 1 and 2 are exactly 100% deterministic results).

Look at the results in the example, in addition to a probability clustering result for each sentence, and for each topic, there are representative words and a proportion. Take topic A as an example, meaning that all the words corresponding to topic A are 30% words broccoli. In the LDA algorithm, each word in each document is mapped to a topic, so the above ratio can be calculated. These words are a good guide for describing this topic, and I think this is the advantage of LDA to distinguish it from traditional text clustering.

LDA Overall process

First define the meaning of some letters:

    • Document Collection D,topic Collection T
    • Each document D in D is treated as a word sequence < w1,w2,..., WN >,wi represents the first word, and D has n words. (LDA, called word bag, actually has no effect on the LDA algorithm where each word appears)
    • All the different words involved in D constitute a large set of vocabulary (VOC)

LDA takes the document collection D as input (there will be cut words, go to stop the word, take the word to wait for the common preprocessing, omit the table), want to train the two result vector (set to gather into K TOPIC,VOC the CCP contains M words):

    • For each d in document D, the probability of corresponding to a different topic θd < PT1,..., PTK, where PTI represents the probability that D corresponds to the topic of the first I in T. The calculation method is intuitive, pti=nti/n, where NTI represents the number of words in d corresponding to the I topic, and n is the total number of all words in D.
    • For topic T in each T, the probability of generating different words φt < PW1,..., PWM, wherein, PWI represents the probability that T generates the first word in VOC. The computational method is also very intuitive, pwi=nwi/n, where NWI represents the number of first words in the VOC corresponding to topic T, and N represents the total number of words corresponding to topic T.

The core formula for LDA is as follows:

P (w|d) = P (w|t) *p (t|d)

Intuitively see this formula, that is, with topic as the middle layer, you can present the probability of the word W in document D through the current Θd and φt. where P (t|d) is calculated using Θd, p (w|t) is calculated using φt.

In fact, using the current θd and φt, we can calculate the P (w|d) for one word in a document for any one of the topic, and then update the topic for that word based on these results. Then, if the update changes the topic of the word, it will affect Θd and φt in turn.

At the beginning of the LDA algorithm, the θd and φt are randomly assigned (to all D and T). Then the process repeats and the result of the convergence is the output of LDA.

Article turned from: 21186353

--lda topic Clustering Model for natural language processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.