http://blog.csdn.net/pipisorry/article/details/42560877
Based on the PLSA model of probability statistics, the EM algorithm is used to learn the model parameters.
The probability map model of PLSA is as follows
where d represents the document, Z represents the implied category or topic, W is the observed word, indicates the probability of the word appearing in the document, the probability of the word appearing under the topic in the document, and the probability of the word appearing on the given topic. And each subject obeys the multinomial distribution on all terms, each document subject to the multinomial distribution on all topics.
The entire document generation process:
(1) The probability of selecting the document;
(2) The probability of selecting the subject;
(3) The probability of producing a word.
The data we can observe is the right, but the implied variable.
The joint distribution is
and the distributions correspond to two sets of multinomial distributions, and we need to estimate the parameters of these two sets of distributions. The detailed derivation process for estimating pLSA parameters with EM algorithm is given below.
Estimate parameters in pLSA by EM
As described in the parameter estimation of the text language model-Maximum likelihood estimation, map and Bayesian estimation, the commonly used parameter estimation methods are MLE, MAP, Bayesian estimation and so on.
But in pLSA, if we try to estimate the parameters directly using MLE, we get the likelihood function.
(P (DI) is irrelevant constant)
This is the number of times the word appears in the document.
Note that this is a function of the and, altogether there are N*k + m*k Independent variables , if the direct derivative of these arguments, we will find that because the independent variables contained in the logarithm and, the solution of this equation is very difficult. So we use EM algorithm for estimating the parameter of probabilistic model which contains "hidden variable" or "Missing data".
The steps of the EM algorithm are:
(1) E-Step: Given the posterior probability under the condition of the current estimated parameters of the implied variable.
(2) M step: To maximize the expectation of the log likelihood function of complete data, we use the posterior probability of the implied variable computed in the E step to obtain the new parameter value.
The two-step iteration is carried out until convergence.
[In pLSA, incomplete data is observed, and the implied variable is the subject, then complete data is ternary]
For our pLSA parameter estimation problem
In the e-step , the posterior probability of the implied variable under the current parameter value is calculated directly using the Bayesian formula, which has
In this step, we assume that all and all are known because of the initial random assignment, and the subsequent iteration takes the value of the parameter obtained in the previous round m step.
in the M step , we maximize the expectation of the complete data logarithmic likelihood function. Its expectation is
Note that this is known, and getting is the estimate inside the e step above. Let's try to maximize the expectation, this is another problem of multivariate function to find the extremum, we can use Lagrange multiplier method. The Lagrange multiplier method can transform the conditional extremum problem into the unconditional extremum problem, in the PLSA, the objective function is that the constraint condition is
Thus we can write the Lagrangian function
This is a function of the and, respectively, the partial derivative, we can get
Note that this is done by multiplying the sum of the equations on both sides of the equation, and by linking the above 4 equations, we can solve the new parameter values in the M step by maximizing expectations.
The key to solve the equations is to find out first, in fact, only need to do a plus and calculate the coefficients can be turned into 1, the latter is good calculation.
Then, using the updated parameter values, we enter the e-step to calculate the posterior probability of the implied variable given the current estimated parameter condition. Iterate so continuously until the termination condition is met.
Notice that we still use the MLE for complete data in M-step, so if we want to add some prior knowledge into our model, we can use the map estimate in the M step. As with the parameter estimation of the text language model-Maximum likelihood estimation, map and Bayesian estimates, the two distributions of the coin are included in the "coin is generally two-sided" priori. The estimated values of the calculated parameters will be preduo counts of the prior parameters in the numerator denominator, and the other steps are the same. You can refer to Mei Qiaozhu notes for details.
pLSA implementation is not difficult, there are many online implementation code.
4 Estimate parameters in a simple mixture Unigram language model by EM
In the pLSA parameter estimation, we use the EM algorithm. EM algorithms are often used to estimate parameter estimation problems that include "missing data" or "hidden variable" models. The two concepts are interrelated, and when we have an "implied variable" in our model, we think that the original data is "incomplete data" because the value of the implied variable cannot be observed; in turn, when our data incomplete, we can model "missing data" by adding an implied variable.
To deepen the understanding of the EM algorithm, let's look at how to use the EM algorithm to estimate the parameters of a simple mixed Unigram language model. This section mainly refers to the Zhai Teacher's EM algorithm notes.
4.1 Maximum likelihood estimation and implicit variable introduction
The so-called Unigram language model, which is to construct a language model, is to discard all contextual information, to think that the probability of a word appearing is irrelevant to its location, and that the probability graph model can be seen in the article of Lda and Gibbs samping. What is a hybrid model (mixture models)? The popular theory of mixed probability model is a new probabilistic model which is formed by linear combination of the most basic probability distribution, such as normal distribution and multivariate distribution, for example, the mixed Gaussian model is obtained by a linear combination of K Gaussian distributions. The exact "component model" that produces data in a mixed model is hidden from us. We assume that the hybrid model contains two multinomial component models, one is the background word generation model and the other is the keyword generation model. Note that this model composition is very common in probabilistic language models, such as the background words used in Twitterlda and the two multivariate distributions of the subject headings, and the Timeuserlda of global Topic and personal Topic two multivariate distributions used in the model. to indicate which model the word is generated from, we will add a Boolean type of control variable for each word.
The log likelihood function for a document is
For the first J of the document I, the parameter that represents the scale of the background word in the document, is usually given by experience. So it is known that we only need an estimate.
In the same way, we first try to estimate the parameters with maximum likelihood estimates. is to find the maximum likelihood function parameter value, there is
This is a function about, the same, contained in the logarithm and in. Therefore, it is difficult to solve the maximal value, using Lagrange multiplier method, you will find that the partial derivative equals 0 The equation is difficult to solve. So we need to rely on the numerical algorithm, and the EM algorithm is one of the commonly used.
We introduce a boolean-type variable z for each word to indicate whether the word is background word or topic word.
Here we assume that "complete data" includes not only all the words that can be observed in F, but also the implied variable Z. So according to the EM algorithm, in the e-step we calculate the "complete data" logarithmic likelihood function has
Compare and, the sum operation is carried out outside the logarithm, because at this time through the control variable Z's setting, we know clearly whether the word is from the background word distribution or the topic Word distribution produces. And what about the relationship? If the estimate parameter is the original data is X, and for each raw data is assigned an implied variable H, then there is
The lower bound analysis of 4.2 likelihood function
The basic idea of the EM algorithm is to initially randomly give the values of the parameters to be estimated, and then to search for better parameter values through the E-step and the M-Step two-step iteration. Better parameter values should be met to make the likelihood function larger. We assume that a potentially better parameter value is that the estimated value of the parameter obtained by the nth iteration m step is that the difference between the likelihood function corresponding to the two parameter values and the likelihood function of "complete data" satisfies
The goal of finding a better parameter value is to maximize it, and it is equivalent to maximizing it. Let's calculate the conditional probability distribution of the implied variable under the condition of the given current data x and the currently estimated parameter values.
The third item on the right is the relative entropy of the sum, which is always non-negative. So we have
So we get the lower bound of the incomplete data likelihood function with a potentially better parameter value. Here we pay particular attention to the right and the second two items as constants, because they are not included. So the lower bound of the incomplete data likelihood function is the expectation of the complete data likelihood function, which is the Q function in the handout of many EM algorithms, the expression is
It can be seen that this expectation equals the complete data likelihood function multiplied by the corresponding implied variable conditional probability and then summed. For the problem we want to solve, the Q function is
Here are a few more explanations of q function. The corresponding variable z is 0 o'clock, the word is topic word, generated from the multivariate distribution, and when Z is 1 o'clock, the word is background word, which is generated from the multivariate distribution. At the same time we can see how to ask the Q function is the complete data likelihood function expectation, that is, we want to maximize the expectation (EM algorithm maximization expectation refers to this expectation), we have to pay special attention to the hidden variables in the observation of the data X and the previous round of the estimated value of the parameter values of the probability of different values, While the different values of the implied variables correspond to the different likelihood functions of complete data, the so-called expectation we want to calculate is the expected value of the likelihood function values of complete data in the case of different implied variables.
General steps of the 4.3 EM algorithm
With a 4.2-part analysis, we know that if we can find a better parameter value in the next iteration, it makes
Then the corresponding will also have, so the general steps of the EM algorithm are as follows
(1) Random initialization of parameter values, or can be initialized based on any prior knowledge of the optimal parameter range.
(2) Continuous two-step iterative search for better parameter values:
(a) The E-step (for expectation) calculates the Q function
(b) M step (maximize) search for better parameter values by maximizing the Q function
(3) The algorithm stops when the likelihood function converges.
Here we need to pay attention to how to ensure that the EM algorithm can find the global optimal solution rather than the local optimal solution? The first method is to try many different initial values of the parameters, and then choose the optimal one from many of the estimated parameter values, and the second is to determine the initial value of the complex model through a simpler model such as a model with only a unique global maximum value.
As you can see from the previous analysis, the advantage of the EM algorithm is that the likelihood function of complete data is more likely to be maximized because the value of the implied variable is assumed to be multiplied by the conditional probability of the implied variable taking the value, so that eventually it becomes the maximum expectation. Since the implied variable becomes the known amount, the Q function is more likely to find the maximum value than the likelihood function of the original incomplete data. Therefore, in the case of "missing data", we can easily maximize the likelihood function of complete data by introducing the implicit variable.
In the E-step, the main difficulty in computing is to calculate the conditional probability of the implied variable, which is in pLSA
In our example of a simple mixed language model,
We assume that the value of z is only relevant for the current word, but it is easy to calculate, but in lda it is more complicated to calculate the conditional probability of the implied variable and maximize the Q function, which can be found in the parameter derivation of the original LDA paper. We can also use the simpler Gibbs sampling to estimate the parameters, see LDA and Gibbs samping for details.
To continue our problem, here is the M step. The maximum value of the Q function is obtained by using Lagrange multiplier method, and the constraint condition is
Constructing Lagrange Auxiliary function
Partial derivative of independent variables
The only extreme point that the partial derivative is 0 solved
It is easy to know that the only extreme point here is the most value point. Notice here that Zhai Teacher changed the variable representation, the Word traversal into the document into the dictionary within the term of the traversal, because the value of z is related to the corresponding word, the context-independent. Therefore, the conditional probability formula of the e-step to find the implied variable also becomes
Finally, we get the update formula of the EM algorithm for the simple mixed Unigram language model.
The e-step is the formula to find the conditional probability of the implied variable and the M step to maximize the expected estimation parameters
As we can see throughout the computational process, we do not need to explicitly find the expression of the Q function. Instead, we calculate the conditional probabilities of the implied variables, and then we get the new parameter estimates by maximizing the Q function.
So the process of two-step iteration of the EM algorithm is essentially searching for a better value of the parameter to be estimated so that the original data, i.e. the lower bound of the incomplete function, is increased, and the "nether" is the expectation of the complete data likelihood function after the introduction of the implied variable. This is the Q function that appears in many of the EM algorithm handouts to find better parameter values by maximizing the Q function. At the same time, the estimated parameter value of the last round is calculated as the conditional probability of the implied variable in the next round e step, and this conditional probability is necessary to maximize the value of the Q function's novelty.
from:http://blog.csdn.net/pipisorry/article/details/42560877
Ref:Topicmodel-em algorithm
Topicmodel-plsa model and EM derivation of pLSA