Http://hi.baidu.com/flyer_hit/blog/item/2ec12d251dd9dd6835a80f55.html
Http://blog.csdn.net/feixiangcq/archive/2010/06/06/5650672.aspx
Http://fan.cos.name/cn/2010/10/fan16/
Http://hi.baidu.com/flyer_hit/blog/item/84d29a733c7751148701b089.html
Lda is a topic model that is more "advanced" than plsa. Where is "advanced? -- It is a Bayes hierarchy model.
The so-called Bayes hierarchy model is simply to regard the model parameters as random variables, which can introduce the parameters of control parameters. Speaking of this, the price comparison is round.
A general topic model is
P (w | D) = Sigma {P (w | z) * P (z | D )}
The topic in the cloud is a one-dimensional language model, with no special features. The formula above is P (w | Z ).
Topic model generally refers to two types of distribution: topic ~ The word distribution is P (w | Z ).
The second type is P (z | D), which is Doc ~ Topic distribution.
With these two distributions, this document set has a three-dimensional feeling, close your eyes, and think carefully:
doc
|
--------------------------------------
|... |
topic_1 topic_2 topic_m
While
Topic_ I
|
----------------------------------------
|... |
Word_1 word_2 word_n
A three-tier document indicates that the space is on paper.
The top layer is what people often call "dimensionality reduction". In fact, the document is projected into the "topic" space.
Doc ~ Topic ~ Word
This Bayes chain can cover a basic idea of lda.
Plsa is actually the chain. What is the difference between it and lda?
The biggest difference is that Doc ~ At the topic level, plsa regards all the variables at this level as model parameters, that is, how many documents have model parameters, while LDA introduces a hyperparameter for Doc ~ The topic level is used for model. In this way, no matter how many documents there are, the outermost layer model is exposed [for Doc ~ Topic] has only one hyperparameter.
So what should we add?
The most basic plsa and LDA are describing Doc ~ Topic and topic ~ Word uses a multinomial model. In order to facilitate the calculation and make a prior sense, the first choice is the combination of a prior and a prior. The wildcard distribution of multinomial distribution is Dirichlet distribution, a nice distribution. This is also the origin of Dirichlet in latent Dirichlet allocation.
Dirichlet prior is a prior of a giant:
The Prior of Bayes prior smoothing is also Dirichlet, because Multinomial is also used in the one-dimensional language model.
The prior introduced in plsa is also Dirichlet. So what are the benefits of it? So fascinating to everyone. We all know the simplicity of computing. Now we can talk about its amazing implict idea:
Bayes prior smoothing:
P (w | D) = {C (W, D) + Mu * P (w | C)}/{| d | + Mu}
The maximum likelihood is
P_ml (w | D) = C (W, d)/| d |
After smoothing, the denominator is C (W, D) + Mu * P (w | C) {originally C (W, d )}
After smoothing, the molecule is | d | + Mu {originally | d |}
So the wonderful thing is in different places:
It seems that the document has multiple Mu words. In this mu new word, there are so many w * P (w | C)
This is the idea of pseudo count. After understanding this, plsa adds a prior derivation and does not need to be pushed. You only need to add these extra anterior words. Everything is OK.
So remember, this is a prior of!
How can we derive the LDA parameters?
There are two methods in general: varitional inference of the author and James sampling.
I am quite familiar with the concept of "James sampling. You can search for gibbslda online.Source code.
As long as you have learned how to get a pair of jobsCodeVery simple.
One of the biggest advantages of the garms sampling is that it is easy to understand. The specific understanding section is omitted.
Back to the above topic:
In this hierarchical structure: Doc ~ Topic ~ Word. Lda is correct (Doc ~ Topic) adds a Prior. Then how does he take advantage of this foresight? Exchangabiltity is used. The so-called interchangeable is conditional independent and identically distributed; note the difference with I. I. d, "Conditional"
Corresponding to Lda, it is the Super parameter given before I. I. D .. can be obtained... You need to understand paper by yourself.
After the doc ~ After the topic is a prior, the process of retrieving different topics in a document is completely independent.
This is also a beautiful place for hierarchical models.
Worker
\\...\
Product_1 product_2 product_m
For example, once a worker's production capacity is determined, all the products it produces are conditional independent and identically distributed.
It is actually a beautiful assumption when we have insufficient information, since we do not know anything. Then we think that, given its superiors, the following things are conditional independent and identically distributed
Let's take another image as an example. If you are a lazy person, you will save your sock and wash it. It will be troublesome to dry the sock. What should we do with so many so? So the merchants were very alert-equipped. They invented a hook on their heads. below was a big turntable to dry the sock so that the sock could be dry... Therefore, exchangabiltity indicates that, if the so is the same, once the hook above is fixed, the conversion of the so below is irrelevant.
The condition independence is a stronger assumption. The entire turntable does not have a hook on its head, but can indeed be suspended in any point in the probability space, and the overall form has not changed.
Well, the topic of sock has come to an end.
Note that many new words are introduced to prevent the appearance of new words in the test stage.
Topic ~ A prior of word. We may have thought of the Dirichlet distribution.
Lda is a three-layer Bayesian probability model that contains three layers of structures: Word, topic, and document.
Document-to-topic follows the Dirichlet distribution, and topic-to-word follows the polynomial distribution.
LDA adds the Dirichlet prior to the mixed weight θ of the topic (note that it is the theme dimension). A hyperparameter α is used to generate the θ parameter, that is, the parameter.
Lda is an unsupervised machine learning technique that can be used to identify hidden topic information in a large-scale document collection or corpus. It uses the bag of words (bag of words) method, which treats each document as a Word Frequency Vector and converts text information into easy-to-model numeric information. However, the bag-of-words method does not consider the order between words, which simplifies the complexity of the problem and provides an opportunity for model improvement. Each document represents a probability distribution composed of topics, and each topic represents a probability distribution composed of many words. Due to the weak correlation between each component of the Dirichlet distribution random vector (there is also a "correlation", because the sum of each component must be 1 ), this makes our hypothetical potential themes almost unrelated, which is inconsistent with many practical problems, resulting in another legacy issue of lda.
For each document in the corpus, lda defines the following generative process ):
1. Extract A topic from the topic distribution for each document;
2. extract a word from the word distribution corresponding to the selected topic;
3. Repeat the process until every word in the document is traversed.
More formally speaking, each document in the corpus corresponds to a multi-distribution (multinomial distribution) of topics (defined in advance through repeated experiments and other methods. Each topic corresponds to a multiclass distribution of words in the vocabulary (Vocabulary) and records the multiclass distribution. The above vocabulary is composed of all different words in all documents in the corpus, but some stopword should be removed during actual modeling, and some stemming should be performed. And respectively have a prior Dirichlet distribution with hyperparameter and. For each word in a document, we extract a topic from multiple distributions corresponding to the document, and then extract a word from multiple distributions corresponding to the topic. Repeat this process to generate the document. Here is the total number of words in the document. This generation process can be represented by the following graph model:
This graph model notation is also called plate notation ). The Shadow circle in the figure indicates the observed variable, the non-shadow circle indicates the latent variable, and the arrow indicates the conditional dependency between the two variables ), the box indicates repeated sampling, and the number of repetitions is in the lower right corner of the box.
This model has two parameters to be inferred (infer), one for "document-topic" distribution, and the other for "Topic-word" distribution. By learning these two parameters, we can know the topics that the author is interested in and the proportion of topics covered by each document. The inference methods mainly include the variational-em proposed by the LDA model author.AlgorithmIn addition, the frequently-used box-sampling method is also available.
The LDA model is now a standard in topic modeling. Since its birth, lda has made many extensions, especially in the social network and social media research fields.
Prerequisites
It is easier to understand the original article if you have mastered these preparations.
-P (x | Y. Note | y on the right can represent a random variable (a specific value has been obtained) or a common non-random variable. In this way, we can easily "Switch" between the maximum likelihood estimation and the Bayes method, without affecting the notation. For example, considering the Gaussian distribution p (x) with definite but unknown parameters μ and Σ, it can be recorded as p (x | μ, Σ, we can regard μ and Σ as random variables, and the distribution of X can be recorded as random variables μ. The conditional distribution p (x | μ, Σ) after a certain value is obtained by Σ) -- unified recording method.
-K is used to obtain 1 distribution/polynomial distribution (multinomial ). Random Variable x ~ considering three discrete values ~ P (x ). This seemingly mediocre distribution... is the so-called K take 1 distribution or polynomial distribution. We usually remember it as P (x_ I) = u_ I, I = 1, 2, 3, and U_1 + u_2 + u_3 = 1. however, in some mathematical derivation, it is more convenient to record it as an exponential form. X is regarded as a three-dimensional random vector. Each component is mutually exclusive, that is, it can only take (, 0), (, 0), (, 1) three groups of values. Therefore, the distribution can be rerecorded as p (x) = (U_1 ^ x_1) * (u_2 ^ x_2) * (u_3 ^ X_3 ). note that in the original article, Multinomial is the K-take-1 distribution, which is different from some definitions in probability textbooks. The common K-dimension conditions are the same. Section 2.2 of the specific parameter [Bishop.
-Conjugate Prior ). Considering a probability density function, we need to estimate the parameter T. According to the Bayes School, the parameter T ~ P (t ). we have P (t | x) ∝ p (x | T) p (t). This formula says: when no observations are made, we use a prior distribution p (t) To represent the knowledge of T. After observing X, we use this formula to update (calculate) the anterior Probability p (t) to the posterior probability P (t | X), so that we can increase the knowledge of T. If p (t) and P (x | T) have the same function form, then the posterior probability P (t | X) is the same as the anterior Probability p (t) there is the same function form-This makes t's posterior probability have the same expression as the anterior probability, but the parameter is updated! Even better, this posterior probability can be used as a prior probability for the next observation, so when we continue to observe X_2, X_3... the prior probability p (t) parameter is constantly updated, and the function form of P (t) remains unchanged. For details, see section 2.2 of [Bishop.
This is also the criticism of the Bayes school: the choice of prior probability is sometimes only convenient for mathematical derivation, rather than accurate reflection of our prior knowledge.
-Dirichlet distribution. Now we can say that the Dirichlet distribution is the Conjugate Prior where K obtains the 1 distribution. If the K-dimensional random vector θ ~ Drichlet distribution, θ's K components θ _ 1, θ _ 2 ,..., θ _ k are continuous non-negative values, and θ _ 1 + θ _ 2 +... + θ _ k = 1. For detailed expression of the Dirichlet distribution, see section 2.2 of [Bishop.
-Simplex. Take a two-dimensional example: a line segment with the endpoint (0, 1) and (1, 0) is simplex. For a three-dimensional example, the inside of a triangle with the endpoint (0, 0, 1), (0, 0, 1) as the endpoint is simplex. The higher dimensions are the same. Consider θ ~ Drichlet distribution. Note that the K components of θ are θ _ 1, θ _ 2 ,..., θ _ k are continuous non-negative values, and θ _ 1 + θ _ 2 +... + θ _ k = 1. We can see that the Dirichlet distribution is a simplex. this is the meaning of the triangle figure 2 in the original article (k = 3, let the simplex triangle lie on the horizontal plane ). See section 2.2 of [Bishop]
-Graphical models. A graph is used to represent dependencies in random variables. This tutorial is a Google folder. We recommend that you refer to section 8.1 of [Bishop] to learn about several symbols (hollow circles -- latent variables, solid circles -- observed variables, and boxes -- repetition times) it is enough to understand figure 1 and figure 3 in the original article. See section 8.2 of [Bishop] at most.
-Em. there are a lot of tutorial about this, but I think section 9.2 of [Bishop] is the most concise and easy to understand in mathematics processing (there is a tutorial that uses a lot of Sigma and attention in key steps, ). In addition, section 9.4 of [Bishop] is also worth noting. It is advantageous to understand other content such as variational inference.
-variational inference is an approximate method for calculating the posterior probability. Consider the random variable {x, z}, where X is the observed variation, Z = {Z_1, Z_2} is the Hidden variable. The key step for using EM method or Bayes reasoning is to require posterior probability P (z | X ). unfortunately, in some complex problems, P (z | x) does not have a parsing expression and needs to be approximately solved. there are many related methods. One method is often used based on the factorization assumption: P (z | x) ≈ P (Z_1 | X) P (z_2 | x) -- that is, to forcibly assume that the Z_1 and Z_2 conditions are independent -- and then perform subsequent derivation.
of course, this assumption produces errors. Considering the two-dimensional Gaussian distribution P (z | X) = P (Z_1, z_2 | X), Z_1 and Z_2 are not independent, therefore, the height of P (Z_1, z_2 | X) is a concentric elliptic, And the elliptic can be skewed at Will (for example, if the linear correlation coefficient between Z_1 and Z_2 is 1, then the elliptic skew is 45 ° ). Note: P (Z_1 | X) = Q_1 (Z_1), P (z_2 | X) = Q_2 (Z_2), we want to change Q_1 and Q_2, defit P (Z_1, z_2 | x) with Q_1 * Q_2 ). however, no matter how Q_1 and Q_2 are changed, the elliptical contour lines of Q_1 * Q_2 are parallel to the Z_1 and Z_2 axes, respectively! However, the appropriate Q_1 and Q_2 ensure that the peak values of Q_1 * Q_2 and P (z | x) overlap. Generally, this is enough to solve the actual problem. For more information, see Chapter 10th of [Bishop. For more information, see section 1.8 of [Winn.