Using Tmtoolkit in Python for topic model LDA Evaluation

Source: Internet
Author: User
Tags assert



Themed Modeling A method of finding abstract art subject art in a large number of documents. With it, it is possible to discover a mix of hidden or "latent" topics that vary from one document in a given corpus . As an unsupervised machine learning method, the topic model is not easy to evaluate because there is no marked "basic facts" data available for comparison. However, because topic modeling often requires some parameters to be predefined (first, the topic to be discovered ? ). , so model evaluation is critical for finding the "best" parameter set for a given data.








Evaluation method of probabilistic LDA thematic model





model evaluation is difficult when using unlabeled data. The indicators described here attempt to evaluate the quality of the model in a theoretical way in order to find the "best" model. It is still important to check whether the model is meaningful . In fact, the best way to evaluate the quality of the model in English people in the loop method, humans must manually insert random "intruder" or "Subject intruder". However, it is best to first select the best model of the theoretical method.


Evaluate the density or divergence of the posterior distribution


Some indicators are only used to evaluate posterior distributions (subject-Word and document-topic distributions) without having to somehow compare the model to the observed data. Juan and others. describes a method that relies on the topic of a model-the paired distance between all topics in a word distribution. They claim that the higher the paired distance between subjects, the higher the density of information that the model captures . The metric boils down to calculating the cosine similarity of each pair of distributions u*v/(|u|*|v|)(which |u| |v|are the L2 norm of each vector) uand vin the subject model's posteriori topic-word distribution, then the mean values of these similarities are taken. The lower the average, the less similar the theme, the better the model (at least through this metric).





Find the best topic model using AP Data




Use Tmtoolkitcalculate and evaluate topic models





The main features of theme modeling are locatedtmtoolkit.lda_utils. In the module because we will use LDA. Package, so we need to install it before we can use the evaluation function that is specific to the package we start by importing the features we need:


import Matplotlib.pyplot as plt # for plotting the results
Plt.style.use (' Ggplot ')
# for loading the data:
From tmtoolkit.utils import unpickle_file
# for model evaluation with the LDA package:
From tmtoolkit.lda_utils import Tm_lda
# for constructing the evaluation plot:
From tmtoolkit.lda_utils.common import results_by_parameter
From tmtoolkit.lda_utils.visualize import plot_eval_results



Next, we load data consisting of a list of document tags, glossaries (unique words) and a document-term-matrixdtm. We ensure thedtmappropriate size:


  1. Doc_labels, vocab, DTM = Unpickle_file (' ap.pickle ')
  2. print ('%d documents,%d vocab size,%d tokens '% (len (doc_labels), Len (vocab), Dtm.sum ()))
  3. Assert len (doc_labels) = = dtm.shape[0]
  4. Assert len (vocab) = = dtm.shape[1]


Now we define a parameter set that should be evaluated we set up a dictionary of constant parameters.const_params, it will be used for each topic model calculation and remains unchanged we also set up.varying_paramscontains a different list of parameters for dictionaries with different parameter values:


Here, we want to calculate different topic models from a series of topicsks = [10, 20, .. 100, 120, .. 300, 350, .. 500, 600, 700]. Since we have 26 different valuesks, we will create and compare 26 theme models. Note that we have alsoalphadefined a parameter for each model1/k(see below for a discussion of α and test hyper-parameters in LDA). The parameter name must match the parameters of the corresponding theme modeling package that you are using. Here we will uselda, so we pass parameters, such asn_iterorn_topics(for example, the parameter names with other packages will also varynum_topics, not whilen_topicsin Gensim).



We can now use the functions in the module toevaluate_topic_modelsstart evaluating our model andtm_ldapass the different parameter lists and dictionaries with constant parameters to it:


By default, this uses all CPU cores to compute the model and evaluate them in parallel. If we are going to evaluate 4 CPU cores and 26 models, Tmtoolkit will start 4 sub-processes and allocate the first 4 model calculations to them, and we have 22 left. When the first model calculation is completed on any of the sub-procedures, the fifth model calculation task is initiated by the sub-procedure, and so on. This ensures that all child processes (and all CPU cores) are always busy.



The function evaluation will cover theeval_resultslist of all Games 2 tuples. Each of these tuples contains a dictionary that contains the parameters used to calculate the model and the evaluation result dictionary returned by each measure. Asresults_by_parameterwe restructure for our interest, we want to draw the results of the parameters on the x-axis:


Theplot_eval_resultsfunction creates 33 drawings using all the measures calculated during the evaluation. Then, if you want, we can use the Matplotlib method to adjust the drawing (for example, to add a drawing title), and finally we display and/or save the drawing.


Results





Subject model Evaluation, alpha = 1/k,beta = 0.01



The graph shows the normalized values of each measure, the scale value between Allen and Juan [0,1], and the logarithmic likelihood [ -1,0]. We can see that the logarithmic likelihood maximum value is k value between 100 and 350. The Arun metric points to a value between 200 and 400. The Duan metric starts to be minimized near K = 100 but does not rise again in another range. This may be because this method only evaluates the keyword distribution. Because the corpus is very large (more than 400,000 words), even for large k, the calculated measure can be very low, because the "density" in the keyword distribution (that is, the paired distance between words) will still be very high in the distribution of each topic.



Note that for the "Loglikelihood" metric, only the logarithmic likelihood estimation of the final model is reported, which differs from the harmonic mean method used by Griffiths and Steyvers. The Griffiths and Steyvers methods cannot be used because it requires a special python package (gmpy2), which is not available on the CPU cluster machine on which I am running the evaluation. However, "logarithmic likelihood" will report very similar results.


Alpha and Beta parameters


In addition to the number of topics, there are Alpha and beta(and sometimes eta) parameters in the literature. Both are used to define Dirichlet Apriori, which is used to calculate the respective posterior distributions. Alpha is a priori "concentration parameter" for the distribution of document-specific topics, and is the previous beta for the distribution of topic-specific words. Both points to a priori belief in the sparse/homogeneous nature of words in subjects and corpora.



Alpha works in the subject sparsity of the document. The Gao Alfa value means that the effect of topic sparsity is small, that the expected document contains a mix of most topics, while a low alpha value means we want the document to cover only a few topics. That's why Alpha is often set to a fraction of the number of topics (such as the 1/kin our assessment): As more and more topics are being discovered, we want each document to contain fewer but more specific topics. As an extreme example: if we only want to find two topics (k = 2), then it is likely that all documents contain two topics (different numbers), so we have a large alpha = value. If we want to find the k = $ theme, it is likely that most documents will not cover all 1000 topics, but only a small part (i.e., very sparse), so we use the low value of alpha = 1/1000 to explain the expected sparsity of the problem.



Similarly, beta works in Word sparsity in the subject. High beta values mean that the effect of word thinning is small, that is, we expect that each topic will contain most of the words of the corpus. These topics will be more "general" and their word probabilities will be more uniform. Low beta values mean that the subject should be more specific, i.e. their word probabilities will be more uneven, thus placing a higher probability on fewer words. Of course, this also relates to the number of topics to be discovered. The high beta means few but more common themes are found, and low beta should be used for more specific more topics. Griffiths and Steyvers explained that the beta "affects the granularity of the model: document Corpora can be reasonably decomposed into a set of topics of different sizes [...]."






Subject model, alpha = 1/k,beta = 0.1



When we run the evaluation with the same alpha parameter and the same K range as above, but when β= 0.1 instead of β= 0.01, we see the logarithmic likelihood maximized in the lower range of K , that is, about 70 to 300 (see) as measured by Alan and others, It points to a value between 70 and 240 Therefore, this confirms our hypothesis about the test:. You should use a higher test phase when trying to find a smaller number of topics. Interestingly, this time the Juan metric is also given ? A valley is displayed in a curve within the value range. This means that when using a model with many themes, higher test values can also result in decreased information density in the keyword distribution.



There are a number of possibilities for combining these parameters, but it is often not easy to explain them. The results of the evaluation of the different scenarios are shown: (1) The fixed values of α and β depend on K, (2) α and β are fixed, (3) α and β depend on K.






(1) Subject model, alpha = 0.1,beta = 1/(10k)






(2) Subject model, alpha = 0.1,beta = 0.01






(3) Subject model, alpha = 1/k,beta = 1/(10k)



The LDA hyper-parameter β and the subject quantity are interrelated and the interaction is very complicated. It is wrong to assume that there is a certain "correct" parameter configuration for a given set of documents. First, it's important to figure out how fine the model should be. If it should cover only a few but very general topics, or should capture more specific topics. You can set the alpha and test phases accordingly, and you can calculate some sample models (for example, by usingcompute_models_parallelfunctions in Tmtoolkit). In most cases, the beta fixed values that define the "granularity" of the model seem reasonable, as recommended by Griffiths and Steyvers. A more granular evaluation of the model, with different alpha parameters (depending on K) can be accomplished with many topics using the explanatory indicator.





Validation of retained data


The subject model can also be validated on retention data. Unfortunately, the mentioned Python software package for theme modeling does not correctly calculate the confusion of keeping data, and Tmtoolkit does not currently offer this. In addition, this is even more computationally intensive, especially when cross-validation is performed. However, it would be interesting to compare these results with the results of cross-validation, which can be done in future work.









▍ need help? Contact Us



Using Tmtoolkit in Python for topic model LDA evaluation


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.