(i) LDA role
The traditional way to judge the similarity of two documents is by looking at the number of words that appear together in two documents, such as TF-IDF, which does not take into account the semantic associations behind the text, which may appear in two documents with little or no words, but two documents are similar.
For example, there are two sentences as follows:
"Jobs left us. ”
"Will the price of apples fall?" ”
You can see that the above two sentences do not appear together, but the two sentences are similar, if the traditional method of judging the two sentences are certainly not similar, so in judging the relevance of the document need to take into account the semantics of the document, and semantic mining is a theme model, LDA is one of the more effective model.
In a topic model, a topic represents a concept, an aspect, and a series of related words, which are the conditional probabilities of these words. Image, the theme is a bucket, which contains a higher probability of the word, these words and the topic has a strong correlation.
How to generate a theme. The topic of the article should be how to analyze. This is the problem that the theme model solves.
First, you can use the build model to look at both the document and the topic. The so-called generative model, that is, we think that every word in an article is obtained by "Selecting a topic with a certain probability and choosing a word from this subject with a certain probability" . So, if we're going to generate a document, the probability of each word appearing in it is:
This probability formula can be represented by a matrix:
Where the "document-word" matrix represents the frequency of each word in each document, the probability of the occurrence; the "subject-word" matrix represents the probability of each word in each topic; The document-subject matrix represents the probability of each topic in each document.
Given a series of documents, the word "document-word" matrix on the left side can be computed by segmenting the documents and calculating the frequency of each of the words in each document. The subject model is trained on the left side of the matrix to learn the two matrices on the right.
There are two types of thematic models: pLSA (probabilisticlatent Semantic analysis) and LDA (latent Dirichlet Allocation), which mainly describes Lda.
(ii) LDA introduction
How to generate m parts of a document containing n words, Latentdirichlet allocation This article describes the 3 methods:
method One: Unigram model
The model generates 1 documents using the following methods:
For each ofthe N words w_n:
Choose a Word w_n ~ p (w);
where n represents the number of words to be generated for the document, W_n represents the generated nth word w,p (W) represents the distribution of the word w, which can be obtained by statistical study of the corpus, such as a book that counts the probabilities of each word appearing in the book.
This method uses the training corpus to obtain the probability distribution function of a word, and then generates a single word each time based on the probability distribution function, using this method to generate m documents. The diagram model is shown in the following figure:
method Two: Mixture of Unigram
The disadvantage of the Unigram model approach is that the resulting text has no subject, is too simple, and the mixture of Unigram method has been improved, and the model uses the following method to generate 1 documents:
Choose a Topicz ~ p (z);
For each ofthe N words w_n:
Choose a Word w_n ~ p (w|z);
where z represents a topic, p (z) represents the probability distribution of a topic, z is generated by probability by P (z), N and W_n ibid. p (w|z) represents the distribution of W at a given z, can be seen as a kxv matrix, K is the number of topics, V is the number of words, each line represents the probability distribution of the word That is, the probability of each word contained in the subject z, through which each word is generated at a certain probability.
This method first selects the selected topic Z, the topic z corresponds to the probability distribution of a word p (w|z), each time a word is generated by this distribution, using the M-times this method to generate m copies of different documents. The diagram model is shown in the following figure:
As can be seen from the above diagram, Z is located outside the rectangle of W, which means that Z generates a document of n words only once, that is, only one document is allowed, that is, it is not very common, and usually a document may contain multiple topics.
Method Three: LDA (latent Dirichlet Allocation)
The Lda method enables a generated document to contain multiple topics that generate 1 documents using the following methods:
chooseparameterθ~ p (θ);
For each ofthe N words w_n:
Choose a topic Z_n ~ p (z|θ);
Choose a Word w_n ~ p (w|z);
where θ is a subject vector, each column of the vector represents the probability of each topic appearing in the document, which is a nonnegative normalized vector, and P (θ) is the distribution of θ, specifically Dirichlet distribution, i.e. distribution distribution; N and w_n ibid; z_n represents the selected subject, p (z|θ) Represents the probability distribution of the topic z at the given θ, specifically the value of θ, which is P (z=i|θ) =θ_i;p (w|z) Ibid.
This method first selects a subject vector θ to determine the probability of each topic being selected. Then, when each word is generated, select a topic z from the topic distribution vector θ and generate a word by the word probability distribution of the topic Z. The diagram model is shown in the following figure:
The joint probabilities of LDA from the above figure are:
The above equation corresponds to the diagram, which can be roughly understood as follows:
As can be seen from the above figure, the three presentation layers of LDA are represented by three colors:
1. Corpus-level (red): Alpha and β represent the corpus-level parameters, i.e. each document is the same, so the build process is sampled only once.
2.document-level (orange): θ is a document-level variable, and each document corresponds to a theta, that is, each document has a different probability of producing each topic z, and all of the generated each document is sampled once θ.
3. Word-level (green): Z and W are word level variables, z is generated by θ, W is generated by Z and β, and a word w corresponds to a subject z.
Through the above discussion of the LDA generation model, we can know that the LDA model learns to train two control parameters α and β from a given input corpus, and learns these two control parameters to determine the model, which can be used to generate the document. where α and β correspond to each of the following information:
α: The distribution P (θ) requires a vector parameter, the parameter of the Dirichlet distribution, to generate a subject θ vector;
β: the word probability distribution matrix P (w|z) corresponding to each topic.
Using w as the observing variable, θ and z as the hidden variables, we can learn α and β by EM algorithm, and the posteriori probability P (θ,z|w) cannot be solved directly in the process, so we need to find a likelihood function to approximate the lower bound to solve the problem. The original text is calculated using the Variational method (Varialtional inference) based on the decomposition (factorization) hypothesis, and the EM algorithm is used. Each time e-step enters α and β, calculates the likelihood function, m-step maximizes the likelihood function, calculates α and β, and iterates until it converges.
Reference Documents:
David M. Blei, Andrewy. Ng, Michael I. Jordan, latentdirichlet Allocation, Journal of machine learning 3, p993-1022,2003