pLSA (probabilistic latent Semantic analysis) model

Source: Internet
Author: User

For the LSA model last said, it can solve multiple words a meaning, but can not solve multiple meanings of a word, is the problem of polysemy, and PLSA model can better solve this problem, first of all, this model hypothesis:

1 Assume that the process of generating a word is this: first select an article Di, and then again based on the selection of a latent variable ZK (can be understood as the theme), and finally on the basis of the generation of a word. If p (DI,WJ) indicates the probability of the first J word in the I text, it is assumed that:

2 Another very important assumption is that the word WJ and the text di are independent based on the latent variable conditions, namely:


Similar to the LSA model, we first get a weighted word frequency matrix (such as TF/IDF), and then we can write the likelihood function as follows:

where N (DI,WJ) represents the element value of the word frequency matrix (i,j), and for this part of P (DI), the summation is a constant, which can be omitted (omitted after I am written as L), without affecting the maximum value.

Continue to derive:

Next Start em derivation!

,

Here Pij (ZK) represents the probability that the first J Word is generated by the potential variable of ZK by the article I document, according to the nature of the previous article, there are:


Then, according to the conditions that are taken by the equals sign, there are:


I think it is necessary to note that the result of an arbitrary i,j,k is a constant C, but it does not mean that the C is the same for any i,j,k, so there is no way to extract the constant C, but no matter how much C is, the following steps are all possible.

And the numerator denominator is summed according to Sigma ZK, there are:


As before, the numerator equals 1, then the denominator equals C, and the denominator is returned to the above formula, which gets:


Here's a puzzle I met, and the paper I was reading did not follow the steps of the EM derivation, but did not know through what way (I did not understand) got Pij (ZK) =p (ZK/WI,DJ), and then using the Bayesian formula and conditional independent hypothesis, get:


The numerator denominator about P (DI) has the same result as Pij (ZK), which is somewhat confusing, but fortunately the result is the same.

The end result is:

The next step is to begin the iteration of the EM algorithm:

E-Step: Randomly selected (in the initial case) or by the previous step m to get the parameter P (ZK/DI) (total k*n), P (WJ/ZK) (total k*m), calculated P (ZK/WI,DJ).

M-Step: P (ZK/WI,DJ) is substituted by P (Zk/di) and P (WJ/ZK) to find the maximum value of L, which is a parameter of P (Zk/di),P (WJ/ZK), and P ( Zk/di) ,P (WJ/ZK) complies with this certain limitation, there are:

This problem can be solved by Lagrange multiplier method, the process is very complex, the result is as follows:



E-Step and M-step alternately execute until convergent.

After convergence, we can use the resulting two probability matrix to do some such as text clustering, word clustering and other processing.

pLSA (probabilistic latent Semantic analysis) model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.