TopicModel-PLSA model and EM derivation of PLSA

Source: Internet
Author: User

TopicModel-PLSA model and EM derivation of PLSA

The PLSA model based on probability statistics uses the EM algorithm to learn model parameters.

The probability graph model of PLSA is as follows:

D indicates the document, Z indicates the implied category or topic, W indicates the observed word, indicating the probability that the word appears in the document, and the probability that the word appears under the topic in the document, specifies the probability that a topic appears a word. Each topic follows the Multinomial distribution on all terms, and each document follows the Multinomial distribution on all topics.

The entire document generation process:

(1) Select a document based on the probability;

(2) Select a topic based on the probability;

(3) generate a word with a probability.

The data we can observe is a pair, but an implicit variable.

The Joint Distribution of is

And the distribution corresponds to the two sets of Multinomial distribution, we need to estimate the parameters of the two sets of distribution. The following describes the detailed derivation process for estimating PLSA parameters using the EM algorithm.

Estimate parameters in PLSA by EM

As described in the text language model parameter estimation-Maximum Likelihood Estimation, MAP and Bayesian estimation, common parameter estimation methods include MLE, MAP, and Bayesian estimation.

However, in PLSA, if we try to use MLE to estimate the parameters directly, we will obtain the likelihood function.

(P (di) is an independent constant)

The number of times a word appears in a document.

Note:This is a function about sum. There are N * K + M * K independent variables.If the partial derivative of these independent variables is obtained directly, it is difficult to solve the equation because the independent variables are contained in the logarithm and the middle. Therefore, we use the EM algorithm to estimate such probability model parameters that contain "implicit variables" or "Missing Data.

The EM algorithm is as follows:

(1) E step: Calculate the posterior probability of the Hidden variable Given under the current estimated parameter conditions.

(2) M step: Maximize the expectation of the Complete data logarithm likelihood function. In this case, we use the posterior probability of the implicit variable calculated in Step E to obtain a new parameter value.

The two-step iteration continues until convergence.

[In PLSA, Incomplete data is observed, and the hidden variable is the topic, so complete data is a triple]

For our PLSA parameter estimation problem

In Step EDirectly use the Bayes formula to calculate the posterior probability of the implicit variable under the current parameter value.

In this step, we assume that all the sums are known, because the values are randomly assigned at the initial time, and the parameter values obtained in the previous M step are obtained in the subsequent iteration process.

In step MTo maximize the expectation of the logarithm likelihood function of Complete data. The expectation is:

Note that this is known, and the obtained value is the estimated value in Step E. Next we will maximize our expectation. This is another problem of the Extreme Value of Multivariate functions. We can use the Laplace multiplier method. The Laplace multiplier method converts the conditional Extreme Value Problem Into an unconditional Extreme Value problem. In PLSA, the objective function is that the constraint condition is

Therefore, we can write the Laplace function.

This is a function about sums. We can obtain the partial derivative of them.

Note that the deformation of both sides of the equation is carried out here, and the four groups of equations above are combined, we can solve the new parameter value in M step by maximizing the expected estimation.

The key to solving the equations lies in first finding them. In fact, we only need to do an addition operation to convert the coefficients into 1, and we can calculate them later.

Then, using the updated parameter value, we go to Step E to calculate the posterior probability of the Hidden variable Given under the current estimated parameter conditions. This iteration continues until the termination conditions are met.

Note that we still use the MLE for Complete Data in M step. If we want to add some prior knowledge into our model, we can use MAP Estimation in M step. Just as in the text language model's parameter estimation-Maximum Likelihood Estimation, MAP and Bayesian estimation, we add the anterior "coins are generally evenly distributed on both sides. The estimated value of the calculated parameter will include the preduo counts about the prior parameter in the denominator. The other steps are the same. For details, refer to the Notes of Mei Qiaozhu.

The implementation of PLSA is not difficult. There are many implementation codes on the Internet.

4 Estimate parameters in a simple mixture unigram language model by EM

In PLSA parameter estimation, we use the EM algorithm. The EM algorithm is often used to estimate the parameter estimation problems of models that contain "Missing Data" or "implicit variables. These two concepts are interrelated. When there is "implicit variable" in our model, we will think that the original data is "incomplete data ", this is because the values of implicit variables cannot be observed. In turn, when our data is incomplete, we can add implicit variables to model "Missing Data.

To better understand the EM algorithm, let's look at how to use the EM algorithm to estimate the parameters of a simple hybrid unigram language model. This section mainly refers to the EM algorithm Notes of Zhai.

4.1 Introduction of Maximum Likelihood Estimation and implicit Variables

The so-called unigram language model is to build a language model that discards all context information and determines that the probability of a word to appear is irrelevant to its location. For details about the probability graph model, see the introduction in LDA and James Samping. What is a mixed model? In layman's terms, the mixed probability model is a new probability model formed by linear combinations of the most basic probability distributions, such as normal distribution and multivariate distribution, for example, the Gaussian mixture model is obtained by linear combination of K Gaussian distributions. The exact "component model" that generates data in the hybrid model is hidden from us. We assume that the hybrid model contains two multinomial component models: one is the background word generation model, and the other is the key word generation model. Note that this model composition method is common in the probabilistic language model, for example, two multivariate distributions of background words and keywords used in TwitterLDA; two multivariate distributions of Global topics and Personal topics used in TimeUserLDA, all of these models. To indicate which model words are generated, A boolean control variable is added for each word.

The log likelihood function of the document is

It is the j word in document I and is a parameter that represents the proportion of background words in the document. It is usually given based on experience. Therefore, it is known. We only need to estimate it.

Similarly, we first try to use the maximum likelihood estimation to estimate the parameters. That is, to find the parameter value of the maximum likelihood function, there are

This is a function. Likewise, it is contained in the logarithm and the center. Therefore, it is difficult to solve the maximum value. By using the method of the Laplace multiplier, you will find that the equation obtained by the partial derivative is equal to 0 is difficult to solve. Therefore, we need to rely on numerical algorithms, and EM algorithms are commonly used.

We introduce a Boolean variable z for each word to indicate whether the word is background word or topic word. That is

Here we assume that "complete data" not only contains all the words that can be observed in F, but also contains the implicit variable z. According to the EM algorithm, in Step E, we calculate the logarithm likelihood function of "complete data ".

Compared with sum, the sum operation is performed outside the logarithm, because by setting the control variable z, we clearly know whether words are generated by the context word distribution or topic word distribution. What is the relationship? If the original data is X and an implicit variable H is assigned to each original data

Analysis of the lower bound of the 4.2 likelihood Function

The basic idea of the EM algorithm is to randomly specify the value of the parameters to be estimated at first, and then continuously search for better parameter values through steps E and M iterations. Better parameter values should be satisfied to make the likelihood function larger. Let's assume that a potentially better parameter value is that the estimated parameter value obtained in step M of the nth iteration is, the difference between the likelihood function corresponding to the two parameter values and the "complete data" likelihood function is satisfied.

The goal of finding a better parameter value is to maximize the value, which is also equivalent to maximizing the value. Calculate the conditional probability distribution of implicit variables under the conditions that the current data X and the estimated parameter values are given.

The third item on the right is the relative entropy of the sum, which is always non-negative. Therefore, we have

Therefore, we get the lower bound of the incomplete data likelihood function with a potentially better parameter value. Here, we must note that the last two items on the right are constants, because they are not included. Therefore, the lower bound of the incomplete data likelihood function is the expectation of the complete data likelihood function, that is, the Q function in many EM algorithm handouts. The expression is

We can see that this expectation is equal to the complete data likelihood function multiplied by the probability of the corresponding implicit Variable Condition and then summed. For the problem we want to solve, the Q function is

Here we will explain a few more Q functions. When the variable z is 0, the word is topic word, which is generated from the multivariate distribution. When z is 1, the word is background word, which is generated from the multivariate distribution. At the same time, we can also see how to calculate the expectation of the Q function, namely the expectation of the complete data likelihood function, that is, the expectation we want to maximize (the expectation of EM algorithm maximization refers to this expectation ), we need to pay special attention to the probability that the implicit variable obtains different values under the observed data X and the estimated parameter values of the previous round, different values of implicit variables correspond to different likelihood functions of complete data. The so-called expectation we want to calculate is the expected values of the likelihood function values of complete data under different values of implicit variables.

4.3 general steps of the EM Algorithm

Through the analysis in part 1, we know that if we can find a better parameter value in the next iteration

Therefore, the general steps of the EM algorithm are as follows:

(1) random initialization parameter values can also be initialized based on any prior knowledge about the optimal parameter value range.

(2) Two-step iteration to find better parameter values:

(A) E step (expectation) Calculate the Q Function

(B) M step (maximization) Find a better parameter value by maximizing the Q Function

(3) The algorithm stops when the likelihood function converges.

How can we ensure that the EM algorithm can find the global optimal solution instead of the local optimal solution? The first method is to try many different initial values of parameters and then select the optimal one from the many estimated parameter values; the second method is to use a simpler model, such as a model with only the unique global maximum value, to determine the initial values of a complex model.

The preceding analysis shows that the advantage of the EM algorithm is that the likelihood function of complete data is easier to maximize, because the values of implicit variables have been assumed, of course, it is necessary to multiply the conditional probability of an implicit variable to obtain this value, so it eventually becomes the maximum expected value. As the implicit variable becomes a known amount, the Q function is easier to calculate the maximum value than the likelihood function of the original incomplete data. Therefore, in the case of "Missing data", we can easily maximize the likelihood function of complete data by introducing implicit variables.

In Step E, the main difficulty is to calculate the conditional probability of implicit variables. In PLSA

In our simple hybrid language model

We assume that the value of z is only related to the current word, which is easy to calculate. However, it is complicated to use this method in LDA to calculate the conditional probability of implicit variables and maximize the Q function, see parameter derivation in the original LDA thesis. We can also estimate the parameters with a simpler parameter, For details, refer to LDA and glassamping.

Continue with our problem. The following is the M step. Returns the maximum value of the Q function by using the Laplace multiplier. The constraint is:

Construct a Laplace helper Function

Returns the partial derivative of the independent variable.

Let the partial derivative be 0 to obtain the unique Extreme Point

It is easy to know that the only extreme point here is the most value point. Note that here, the Zhai teacher changed the variable expression and converted the traversal of the words in the document into the traversal of the term in the dictionary, because the value of z is related to the corresponding word, it is irrelevant to the context. Therefore, the formula for calculating the conditional probability of implicit variables in Step E also becomes

Finally, we get the EM algorithm update formula for the simple hybrid Unigram language model.

That is, the formula for finding the conditional probability of implicit variables by Step E and maximizing the expected estimation parameters by step M

The entire calculation process shows that we do not need to explicitly find the expression of the Q function. Instead, we calculate the conditional probability of the implicit variable, and then obtain the new parameter estimation value by maximizing the Q function.

Therefore, the two-step iteration process of the EM algorithm is essentially looking for a better value of the parameter to be estimated, so that the lower bound of the original data, that is, the incomplete data likelihood function, is constantly improved, this "Lower Bound" is the expectation of the complete data likelihood function after the implicit variable is introduced, that is, the Q function that appears in many EM algorithm handouts, maximize the Q function to find better parameter values. At the same time, the parameter values estimated in the previous round will calculate the conditional probability of implicit variables as known conditions in the Next Step E, the conditional probability is required to maximize the value of the new Q function.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.