Gensim, pythongensim

Source: Internet
Author: User

Gensim, pythongensim

Http://blog.csdn.net/pipisorry/article/details/42460023

Evaluate the quality of the LDA topic model and determine the modeling capability of improved parameters or algorithms.

Perplexity is only a crude measure, it's helpful (when using LDA) to get 'close' to the appropriate number of topics in a corpus.

Mr. Blei used the Perplexity value as the criterion in the Latent Dirichlet Allocation experiment.


1. Perplexity Definition

Http://en.wikipedia.org/wiki/Perplexity

Perplexity is an information theory measurement method. The perplexity value of B is defined as B-based entropy energy (B can be a probability distribution or probability model ), usually used for comparison of probability models

Three types of perplexity computing are listed on the wiki:

1.1 perplexity of Probability Distribution

Formula:

H (p) is the entropy of the probability distribution. When K of probability P is evenly distributed, the perplexity value of P is K.

1.2 perplexity of Probability Model

Formula:

In the formula, Xi is the test unit, can be a sentence or text, and N is the size of the test set (used for normalization). The smaller the value of perplexity for unknown q distributions, the better the model is.

The exponent part can also be calculated using cross entropy.

1.3 word perplexity

Perplexity is often used for language model evaluation. The physical meaning is the size of the word encoding. For example, if the perplexity value of the language model is 2 ^ 190 in a test statement, it indicates that the sentence encoding needs to be 190 bits.

 

Ii. How to model the topic model of LDA

Mr. Blei only lists the formula of perplexity.

M indicates the number of texts in the test corpus set, Nd indicates the size of the d text (that is, the number of words), and P (Wd) indicates the probability of the text.

Calculation of text Probability:

When solving this problem, we can see rickjin's explanation as follows:

P (z) indicates the distribution of Text d on the topic z. It should be p (z | d)

Note: Blei calculates perplexity from the perspective of each text, while rickjin calculates perplexity from the word perspective.

To sum up, the test text contains M articles. For any word w, P (w) = Σ z p (z | d) in the bag-of-words model) * p (w | z), that is, the product of the word's topic distribution values in all topics and the topic distribution of the word's text.

The perplexity of the model is exp ^ {-(Σ log (p (w)/(N)}, Σ log (p (w )) is to take the log for all words (directly multiply are generally converted into the calculation form of the index and the logarithm), N of the number of words in the test set (not to rank the weight)


Way to estimate the perplexity within gensim

The 'ldamodel. bound () 'method computes a lower bound on perplexity, based on a supplied corpus (~ Of held-out documents ).
This is the method used in Hoffman & Blei & Bach in their "Online Learning for LDA" NIPS article.

Https://groups.google.com/forum! Topic/gensim/LM619SB57zM]

You can also usemodel.log_perplexity(heldout), Which is a convenience wrapper.


Evaluate a Language model Evaluating Language

Now suppose:

  • We have some test data. The test data contains m sentences; s1, s2, s3 ..., Sm

We can view the probability under a model:

We also know that it is very troublesome to calculate the multiplication. On this basis, we can calculate the quality of the model in another form.

On the basis of multiplication, Log is used to convert multiplication into addition for calculation.

In addition, the p (Si) Here is actually equivalent to the q (the | *, *) * q (dog | *, the) * q (…) introduced earlier (...)...

With the above formula, the principle of evaluating whether a model is good or bad is as follows:

A good model shocould assign as high probability as possible to these test data sentences.


This value as being a measure of how well the alleviate to make something less painful or difficult to deal with language model predict these test data sentences. The higher the better.

  • In fact, the general evaluation indicator is perplexity.

The M value is the total number of test data.

You can see from the formula. The smaller the perplexity value, the better.

To better understand perplexity, let's look at the following example:

  • We now have a word set V, N = | V | + 1

With the above conditions, it is easy to calculate:

Perplexity is the value of the branching factor.

What is branching factor? Some translateSplit Rate. If branching factor is high, the higher the computing cost. It can also be understood that the higher the split rate, the more possibilities there will be, and the larger the amount to be calculated.

The above example q = 1/N is just an example. Let's look at the following real data:

  • Goodman results, where | V | = 50000, in trigram model, Perplexity = 74
  • In bigram model, Perplexity = 137
  • In the unigram model, perplexity = 955

We can also see that the perplexity values of several models are different, which indicates that the ternary model generally has good performance.

[Http://www.tuicool.com/articles/M7rAZv]



Questions find in:

The mailing list of gensim



From: http://blog.csdn.net/pipisorry/article/details/42460023

Ref: Topic models evaluation in Gensim

Http://stackoverflow.com/questions/19615951/topic-models-evaluation-in-gensim

Http://www.52ml.net/14623.html

Ngram model and perplexity in NLTK

Http://www.researchgate.net/publication/221484800_Improving_language_model_perplexity_and_recognition_accuracy_for_medical_dictations_via_within-domain_interpolation_with_literal_and_semi-literal_corpora

Investigating the relationship between language model perplexity and IR precision-recall measures.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.