Gensim, pythongensim

Last Update:2015-01-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://blog.csdn.net/pipisorry/article/details/42460023

Evaluate the quality of the LDA topic model and determine the modeling capability of improved parameters or algorithms.

Perplexity is only a crude measure, it's helpful (when using LDA) to get 'close' to the appropriate number of topics in a corpus.

Mr. Blei used the Perplexity value as the criterion in the Latent Dirichlet Allocation experiment.

1. Perplexity Definition

Http://en.wikipedia.org/wiki/Perplexity

Perplexity is an information theory measurement method. The perplexity value of B is defined as B-based entropy energy (B can be a probability distribution or probability model ), usually used for comparison of probability models

Three types of perplexity computing are listed on the wiki:

1.1 perplexity of Probability Distribution

Formula:

H (p) is the entropy of the probability distribution. When K of probability P is evenly distributed, the perplexity value of P is K.

1.2 perplexity of Probability Model

Formula:

In the formula, Xi is the test unit, can be a sentence or text, and N is the size of the test set (used for normalization). The smaller the value of perplexity for unknown q distributions, the better the model is.

The exponent part can also be calculated using cross entropy.

1.3 word perplexity

Perplexity is often used for language model evaluation. The physical meaning is the size of the word encoding. For example, if the perplexity value of the language model is 2 ^ 190 in a test statement, it indicates that the sentence encoding needs to be 190 bits.

Ii. How to model the topic model of LDA

Mr. Blei only lists the formula of perplexity.

M indicates the number of texts in the test corpus set, Nd indicates the size of the d text (that is, the number of words), and P (Wd) indicates the probability of the text.

Calculation of text Probability:

When solving this problem, we can see rickjin's explanation as follows:

P (z) indicates the distribution of Text d on the topic z. It should be p (z | d)

Note: Blei calculates perplexity from the perspective of each text, while rickjin calculates perplexity from the word perspective.

To sum up, the test text contains M articles. For any word w, P (w) = Σ z p (z | d) in the bag-of-words model) * p (w | z), that is, the product of the word's topic distribution values in all topics and the topic distribution of the word's text.

The perplexity of the model is exp ^ {-(Σ log (p (w)/(N)}, Σ log (p (w )) is to take the log for all words (directly multiply are generally converted into the calculation form of the index and the logarithm), N of the number of words in the test set (not to rank the weight)

Way to estimate the perplexity within gensim

The 'ldamodel. bound () 'method computes a lower bound on perplexity, based on a supplied corpus (~ Of held-out documents ).
This is the method used in Hoffman & Blei & Bach in their "Online Learning for LDA" NIPS article.

Https://groups.google.com/forum! Topic/gensim/LM619SB57zM]

You can also usemodel.log_perplexity(heldout), Which is a convenience wrapper.

Evaluate a Language model Evaluating Language

Now suppose:

We have some test data. The test data contains m sentences; s1, s2, s3 ..., Sm

We can view the probability under a model:

We also know that it is very troublesome to calculate the multiplication. On this basis, we can calculate the quality of the model in another form.

On the basis of multiplication, Log is used to convert multiplication into addition for calculation.

In addition, the p (Si) Here is actually equivalent to the q (the | *, *) * q (dog | *, the) * q (…) introduced earlier (...)...

With the above formula, the principle of evaluating whether a model is good or bad is as follows:

A good model shocould assign as high probability as possible to these test data sentences.

This value as being a measure of how well the alleviate to make something less painful or difficult to deal with language model predict these test data sentences. The higher the better.

In fact, the general evaluation indicator is perplexity.

The M value is the total number of test data.

You can see from the formula. The smaller the perplexity value, the better.

To better understand perplexity, let's look at the following example:

We now have a word set V, N = | V | + 1

With the above conditions, it is easy to calculate:

Perplexity is the value of the branching factor.

What is branching factor? Some translateSplit Rate. If branching factor is high, the higher the computing cost. It can also be understood that the higher the split rate, the more possibilities there will be, and the larger the amount to be calculated.

The above example q = 1/N is just an example. Let's look at the following real data:

Goodman results, where | V | = 50000, in trigram model, Perplexity = 74
In bigram model, Perplexity = 137
In the unigram model, perplexity = 955

We can also see that the perplexity values of several models are different, which indicates that the ternary model generally has good performance.

[Http://www.tuicool.com/articles/M7rAZv]

Questions find in:

The mailing list of gensim

From: http://blog.csdn.net/pipisorry/article/details/42460023

Ref: Topic models evaluation in Gensim

Http://stackoverflow.com/questions/19615951/topic-models-evaluation-in-gensim

Http://www.52ml.net/14623.html

Ngram model and perplexity in NLTK

Http://www.researchgate.net/publication/221484800_Improving_language_model_perplexity_and_recognition_accuracy_for_medical_dictations_via_within-domain_interpolation_with_literal_and_semi-literal_corpora

Investigating the relationship between language model perplexity and IR precision-recall measures.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More