Probabilistic potential semantic analysis of thematic models (probabilistic latent Semantic analyses)

Last Update:2015-09-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In a previous article summarizing the potential semantic analysis (latent Semantic, LSA), LSA mainly uses the method of singular value decomposition in linear algebra, but there is no strict probability deduction, because the dimension of text document is often very high, If the computational complexity of singular value decomposition is very high in the topic clustering, the use of probabilistic derivation can be solved by using some optimization iterative algorithms.

Thomas Hofmann defined the generation model in 1998 based on the likelihood principle and proposed a probabilistic latent Semantic analysis model (probabilistic latent Semantic analyses), or PLSA.

pLSA is a generation model in the probabilistic graph model, and the related models are the one-dimensional model in the language model (Unigram models), the hybrid one-element model (Mixture of unigrams models) and so on.

Start by setting up the anthology. Suppose that the corpus of the dictionary has a total of v words, the dictionary vector is , if the word is independent of the same distribution (a bag of words), that is , an article can be expressed as , which indicates the number of occurrences of the first word in the current document.

The unary model assumes that the words in each document are independent of the polynomial distribution, that is, the number of occurrences of the word I in the dictionary follows the polynomial distribution, that is . For example, suppose we have a V-side dice, the dice I face up to the probability that each throw a dice to get a word, throw m-time after a document composed of M-word (the document is separated by the same distribution of words). Depending on the distribution of the polynomial, you can know the probability of the document

The graph (a) is the probability graph model of a meta-language model, and the graph (b) is the probability graph model of pLSA.

The PLSA model is different from the polynomial distribution in the one-dimensional model, and the latent layer variable is introduced as the subject variable in the PLSA model , which assumes that the current anthology is composed of K-subjects. An m-Document Set as the text set, representing the T topic, which is the first term. The probability distribution for all documents is the probability of the document, the distribution of the subject corresponding to the document , and the distribution of the Word corresponding to the subject .

The build process for the PLSA model is as follows:

Select a document based on probability
Choose a latent subject based on probability , and
generates a word based on probability , and

Of course each model has certain assumptions about the corpus, and pLSA makes the following assumptions:

each pair of co-yuan is independent
When a latent variable is known, the variables and variables are conditionally independent.

pLSA was originally based on the visual model (Aspect models), assuming that 1 is similar to the hypothesis of "one bag word" in a unary model, assuming that 2 is related to the generation of the graph model defined by PLSA, the pLSA graph model is similar to the X->Z->Y model, In Bayesian networks is called indirect causal effects ("Indirect causal Effect"). For example, x means you have 2 dollars in your pocket, z means you can buy a pen, y means you can take the exam. If you don't know if you can buy a pen (Z), then whether you have 2 dollars (X) in your pocket will affect whether you can take the exam (Y). But if you already know if you can buy a pen (Z), then whether you have 2 dollars (X) in your pocket will not affect your ability to take the exam (Y). Known as the variable z, the variable x and y are independent.

The pLSA is ultimately the probability that each meta-dollar corresponds to the subject, namely . Let's deduce the formula. The PLSA uses a maximum likelihood estimate (MLE).

The logarithmic likelihood function of the implicit variable z is first solved :

Due to the assumption of the PLSA model 1, there are:

Which indicates the number of occurrences of the first word in the nth document.

And since D and W are independent in the case of the known variable z, it is possible to:

pLSA using EM algorithm to solve maximum likelihood, EM algorithm is a very common iterative approximation algorithm in machine learning. It is generally used to solve the parameter values of maximum likelihood or great posteriori. The e-step refers to the posterior probability of the implicit variable in the case of the current parameter (expectation), and the M-step refers to the parameter value (maximization) for obtaining a maximum likelihood or a maximum posteriori.

First, the expected value of the above likelihood function is computed:

There are two restrictions on the formula:

Based on the Lagrange multiplier method, the extremum is deduced and two parameters are set for two constraints respectively :

In the above formula, the variable and the biased derivative are obtained respectively:

To be linked to the previous constraints:

Then I get the equation of M-Step seeking maximization.

pLSA's em step can be simplified as follows:

E-step: Calculate the posteriori probability of the variable Z

M step: Calculate ,

The problem with pLSA is that the variable contains document D, which is limited in the model to make it difficult to apply to other documents. The LDA (latent Dirichlet Allocation), proposed by David Blei, was to set two parameters for the corpus to omit the fixed variable of the document.

pLSA implementation of C + + code: "Not yet uploaded"

https://blog-potatolife.rhcloud.com/?p=147

Probabilistic potential semantic analysis of thematic models (probabilistic latent Semantic analyses)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Probabilistic potential semantic analysis of thematic models (probabilistic latent Semantic analyses)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Probabilistic potential semantic analysis of thematic models (probabilistic latent Semantic analyses)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support