Gensim, pythongensim
Http://blog.csdn.net/pipisorry/article/details/42460023
Evaluate the quality of the LDA topic model and determine the modeling capability of improved parameters or algorithms.
Perplexity is only a crude measure, it's helpful (when using LDA) to get 'close' to the appropriate number of topics in a corpus.
Mr. Blei used the Perplexity value as the criterion in the Latent Dirichlet Allocation experiment.
1. Perplexity Definition
Http://en.wikipedia.org/wiki/Perplexity
Perplexity is an information theory measurement method. The perplexity value of B is defined as B-based entropy energy (B can be a probability distribution or probability model ), usually used for comparison of probability models
Three types of perplexity computing are listed on the wiki:
1.1 perplexity of Probability Distribution
Formula:
H (p) is the entropy of the probability distribution. When K of probability P is evenly distributed, the perplexity value of P is K.
1.2 perplexity of Probability Model
Formula:
In the formula, Xi is the test unit, can be a sentence or text, and N is the size of the test set (used for normalization). The smaller the value of perplexity for unknown q distributions, the better the model is.
The exponent part can also be calculated using cross entropy.
1.3 word perplexity
Perplexity is often used for language model evaluation. The physical meaning is the size of the word encoding. For example, if the perplexity value of the language model is 2 ^ 190 in a test statement, it indicates that the sentence encoding needs to be 190 bits.
Ii. How to model the topic model of LDA
Mr. Blei only lists the formula of perplexity.
M indicates the number of texts in the test corpus set, Nd indicates the size of the d text (that is, the number of words), and P (Wd) indicates the probability of the text.
Calculation of text Probability:
When solving this problem, we can see rickjin's explanation as follows:
P (z) indicates the distribution of Text d on the topic z. It should be p (z | d)
Note: Blei calculates perplexity from the perspective of each text, while rickjin calculates perplexity from the word perspective.
To sum up, the test text contains M articles. For any word w, P (w) = Σ z p (z | d) in the bag-of-words model) * p (w | z), that is, the product of the word's topic distribution values in all topics and the topic distribution of the word's text.
The perplexity of the model is exp ^ {-(Σ log (p (w)/(N)}, Σ log (p (w )) is to take the log for all words (directly multiply are generally converted into the calculation form of the index and the logarithm), N of the number of words in the test set (not to rank the weight)
Way to estimate the perplexity within gensim
The 'ldamodel. bound () 'method computes a lower bound on perplexity, based on a supplied corpus (~ Of held-out documents ).
This is the method used in Hoffman & Blei & Bach in their "Online Learning for LDA" NIPS article.
Https://groups.google.com/forum! Topic/gensim/LM619SB57zM]
You can also usemodel.log_perplexity(heldout)
, Which is a convenience wrapper.
Evaluate a Language model Evaluating Language
Now suppose:
- We have some test data. The test data contains m sentences; s1, s2, s3 ..., Sm
We can view the probability under a model:
We also know that it is very troublesome to calculate the multiplication. On this basis, we can calculate the quality of the model in another form.
On the basis of multiplication, Log is used to convert multiplication into addition for calculation.
In addition, the p (Si) Here is actually equivalent to the q (the | *, *) * q (dog | *, the) * q (…) introduced earlier (...)...
With the above formula, the principle of evaluating whether a model is good or bad is as follows:
A good model shocould assign as high probability as possible to these test data sentences.
This value as being a measure of how well the alleviate to make something less painful or difficult to deal with language model predict these test data sentences. The higher the better.
- In fact, the general evaluation indicator is perplexity.
The M value is the total number of test data.
You can see from the formula. The smaller the perplexity value, the better.
To better understand perplexity, let's look at the following example:
- We now have a word set V, N = | V | + 1
With the above conditions, it is easy to calculate:
Perplexity is the value of the branching factor.
What is branching factor? Some translateSplit Rate. If branching factor is high, the higher the computing cost. It can also be understood that the higher the split rate, the more possibilities there will be, and the larger the amount to be calculated.
The above example q = 1/N is just an example. Let's look at the following real data:
- Goodman results, where | V | = 50000, in trigram model, Perplexity = 74
- In bigram model, Perplexity = 137
- In the unigram model, perplexity = 955
We can also see that the perplexity values of several models are different, which indicates that the ternary model generally has good performance.
[Http://www.tuicool.com/articles/M7rAZv]
Questions find in:
The mailing list of gensim
From: http://blog.csdn.net/pipisorry/article/details/42460023
Ref: Topic models evaluation in Gensim
Http://stackoverflow.com/questions/19615951/topic-models-evaluation-in-gensim
Http://www.52ml.net/14623.html
Ngram model and perplexity in NLTK
Http://www.researchgate.net/publication/221484800_Improving_language_model_perplexity_and_recognition_accuracy_for_medical_dictations_via_within-domain_interpolation_with_literal_and_semi-literal_corpora
Investigating the relationship between language model perplexity and IR precision-recall measures.