Spark Machine Learning (8): LDA topic model algorithm

Source: Internet
Author: User

1. Basic knowledge of LDA

LDA (latent Dirichlet Allocation) is a thematic model. LDA a three-layer Bayesian probabilistic model that contains the word, subject, and document three-layer structures.

LDA is a build model that can be used to generate a document that, when generated, chooses a topic based on a probability, then a word in the subject of probability selection, so that a document can be generated, and in turn, LDA is an unsupervised machine learning technology, You can identify topics in a large-scale document set or corpus.

LDA's original paper gives a very simple example. Arts, budgets, children, education are 4 themes, and the following are the words contained in each theme.

You can then randomly select the theme and the words within each topic, repeating multiple times to create a document with different colors representing the words from different topics.

Visible, the document and Word are visible, and the subject is hidden.

The probability of a word appearing in a document can be expressed in a formula:

where D is the document, W is the word, z is the subject, and K is the number of topics. You can think of three matrices:

The first matrix represents the probability of each word appearing in each document, and the second matrix represents the probability of each topic appearing in each document, and the third matrix represents the probability of each word appearing in each topic. In machine learning, based on the documentation set, we can calculate the first matrix, which requires the second matrix and the third matrix.

2. Maximum Likelihood estimation

The basic idea of maximum likelihood estimation is that after extracting n samples from the whole, the most reasonable estimation of the parameters should be the most probable parameter estimation for this batch of samples. For example, if you're in a small city, you rarely see Americans, and by chance you see a few Americans are tall, you can estimate that Americans are generally very tall, because it's the only way you'll see that a few Americans are tall.

3. Em method

EM, exception maximization, is one of the important algorithms of machine learning, which plays an important role in machine learning. Simply put, the EM method is to solve the problem: Want to estimate two parameters A and B, these two parameters are unknown, know that parameter a can get parameter B, and in turn know the parameter B can get parameter A, then we can first give a an initial value, then calculate B, and then calculate the B and then calculated a, This iterative iteration continues until the convergence is reached. It is mathematically possible to prove that this method is effective.

4. Beta distribution and Dirichlet distribution

The beta distribution is a conjugate prior distribution of two distributions:

such as tossing coins, 3 times appear on the front, 2 times the back, a=3,b=2, you can get a probability distribution map, from the probability distribution map can be seen, x=0.6 when the function to obtain the maximum value, so you can think that the value of x is likely to be close to 0.6, and throw 5 times, 2 front, 3 times back, a=5,b= 5, you can get a new probability distribution map, x=0.5 when the function gets the maximum value, you can think that the value of x is likely to be close to 0.5.

The Dirichlet distribution is similar to the beta distribution and is a generalization of the beta distribution in a high-dimensional dimension:

For example, throw the dice, throw 60 times, 6 faces, each appeared 10 times, you can get a probability distribution map, x= (1/6,1/6,1/6,1/6,1/6,1/6) When the function gets the maximum value, the value of x is likely to be close to (1/6,1/6,1/6,1/6,1/6,1/6).

5. LDA's EM algorithm

Specifically to LDA, the steps to adopt the EM method are as follows:

(1) Random Assignment of matrix WK and KJ, where WK is the number of occurrences of each word in each topic, and KJ is the number of occurrences of each topic in each document, although these are just random numbers, we can still count on these times. The probability of the most likely occurrence of each word in each topic is calculated using the Dirichlet distribution, and the probability of the most likely occurrence of each topic in each document is equivalent to the initial value of the second and third matrices above;

(2) for a word in a document, it is calculated by which topic is generated , because there may be more than one topic will produce this word, then it belongs to which topic? At this point it is necessary to use the maximum likelihood estimate. Calculate the probability that each topic produces the word:

Then find out the subject of the most probability, think that this word is produced by this topic, which belongs to E-step in the EM method;

(3) As the subject of the word is determined, the value of a in the Dirichlet distribution is changed, and a new probability matrix (the second and third matrices above) is computed, which belongs to M-step in the EM method.

Repeat steps (2) and (3) to get the final probability matrix (that is, the second and third matrices above), and the machine learns to end.

6. Realization of LDA in Mllib

Mllib uses GRAPHX to implement LDA. There are two types of nodes: Word nodes and document nodes. Each word node stores a word, and the probability that the word belongs to each topic; Stores a document on each document node and the probability that the document belongs to each topic. For example, 3 words and two documents are stored, hockey and system appear in Article1, launch and system appear in Article2.

During the iteration, the document node updates its subject probabilities by collecting data from the neighbor nodes (that is, the word nodes), as shown in.

Spark Machine Learning (8): LDA topic model algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.