Probabilistic potential semantic analysis of thematic models (probabilistic latent Semantic analyses)

Source: Internet
Author: User

In a previous article summarizing the potential semantic analysis (latent Semantic, LSA), LSA mainly uses the method of singular value decomposition in linear algebra, but there is no strict probability deduction, because the dimension of text document is often very high, If the computational complexity of singular value decomposition is very high in the topic clustering, the use of probabilistic derivation can be solved by using some optimization iterative algorithms.

Thomas Hofmann defined the generation model in 1998 based on the likelihood principle and proposed a probabilistic latent Semantic analysis model (probabilistic latent Semantic analyses), or PLSA.

pLSA is a generation model in the probabilistic graph model, and the related models are the one-dimensional model in the language model (Unigram models), the hybrid one-element model (Mixture of unigrams models) and so on.

Start by setting up the anthology. Suppose that the corpus of the dictionary has a total of v words, the dictionary vector is , if the word is independent of the same distribution (a bag of words), that is , an article can be expressed as , which indicates the number of occurrences of the first word in the current document.

The unary model assumes that the words in each document are independent of the polynomial distribution, that is, the number of occurrences of the word I in the dictionary follows the polynomial distribution, that is . For example, suppose we have a V-side dice, the dice I face up to the probability that each throw a dice to get a word, throw m-time after a document composed of M-word (the document is separated by the same distribution of words). Depending on the distribution of the polynomial, you can know the probability of the document

The graph (a) is the probability graph model of a meta-language model, and the graph (b) is the probability graph model of pLSA.

The PLSA model is different from the polynomial distribution in the one-dimensional model, and the latent layer variable is introduced as the subject variable in the PLSA model , which assumes that the current anthology is composed of K-subjects. An m-Document Set as the text set, representing the T topic, which is the first term. The probability distribution for all documents is the probability of the document, the distribution of the subject corresponding to the document , and the distribution of the Word corresponding to the subject .

The build process for the PLSA model is as follows:

    1. Select a document based on probability
    2. Choose a latent subject based on probability , and
    3. generates a word based on probability , and

Of course each model has certain assumptions about the corpus, and pLSA makes the following assumptions:

    1. each pair of co-yuan is independent
    2. When a latent variable is known, the variables and variables are conditionally independent.

pLSA was originally based on the visual model (Aspect models), assuming that 1 is similar to the hypothesis of "one bag word" in a unary model, assuming that 2 is related to the generation of the graph model defined by PLSA, the pLSA graph model is similar to the X->Z->Y model, In Bayesian networks is called indirect causal effects ("Indirect causal Effect"). For example, x means you have 2 dollars in your pocket, z means you can buy a pen, y means you can take the exam. If you don't know if you can buy a pen (Z), then whether you have 2 dollars (X) in your pocket will affect whether you can take the exam (Y). But if you already know if you can buy a pen (Z), then whether you have 2 dollars (X) in your pocket will not affect your ability to take the exam (Y). Known as the variable z, the variable x and y are independent.

The pLSA is ultimately the probability that each meta-dollar corresponds to the subject, namely . Let's deduce the formula. The PLSA uses a maximum likelihood estimate (MLE).

The logarithmic likelihood function of the implicit variable z is first solved :

Due to the assumption of the PLSA model 1, there are:

Which indicates the number of occurrences of the first word in the nth document.

And since D and W are independent in the case of the known variable z, it is possible to:

pLSA using EM algorithm to solve maximum likelihood, EM algorithm is a very common iterative approximation algorithm in machine learning. It is generally used to solve the parameter values of maximum likelihood or great posteriori. The e-step refers to the posterior probability of the implicit variable in the case of the current parameter (expectation), and the M-step refers to the parameter value (maximization) for obtaining a maximum likelihood or a maximum posteriori.

First, the expected value of the above likelihood function is computed:

There are two restrictions on the formula:

Based on the Lagrange multiplier method, the extremum is deduced and two parameters are set for two constraints respectively :

In the above formula, the variable and the biased derivative are obtained respectively:

To be linked to the previous constraints:

Then I get the equation of M-Step seeking maximization.

pLSA's em step can be simplified as follows:

E-step: Calculate the posteriori probability of the variable Z

M step: Calculate ,

The problem with pLSA is that the variable contains document D, which is limited in the model to make it difficult to apply to other documents. The LDA (latent Dirichlet Allocation), proposed by David Blei, was to set two parameters for the corpus to omit the fixed variable of the document.

pLSA implementation of C + + code: "Not yet uploaded"

https://blog-potatolife.rhcloud.com/?p=147

Probabilistic potential semantic analysis of thematic models (probabilistic latent Semantic analyses)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.