--sentencelda, Copulalda and TWE of theme model ︱ several new topic models

Source: Internet
Author: User
Tags numeric value nltk

Baidu recently open-source a new project on the theme model. Document topic Inference Tools, semantic matching calculation tools, and three thematic models based on industry-level corpus training: Latent
Dirichlet Allocation (LDA), Sentencelda, and topical Word embedding (TWE).
. I. Introduction of Familia

Help Familia, make a small ad ~ Familia's GitHub
The application paradigm of subject model in industry can be abstracted into two kinds: semantic representation and semantic matching.

Semantic representation (Semantic representation)
The document is subject to dimensionality reduction and the semantic representation of the document can be applied to downstream applications such as text categorization, text content analysis, CTR estimation, etc.

Semantic match (Semantic Matching)

To calculate the semantic matching between text, we provide two types of text similarity calculation:

-Short texts-long text similarity calculation, using scenes including document keyword extraction, calculation of search engine query and Web page similarity and so on.
-long text-long text similarity calculation, usage scenarios include calculating the similarity of two documents, calculating the similarity of user portrait and news, etc.
The demo Familia comes with the following features:Semantic representation calculation

The subject model is used to infer the topic of the input document, in order to get the topic dimensionality reduction representation of the document.

Semantic matching calculation

Calculates the similarity between text, including short text-long, long text-a similarity calculation between long texts.

Model Content Presentation
The theme of the model, the nearest neighbor words to display, user-friendly to the model of the subject has a visual understanding.

. second, topical Word embedding (TWE)

Zhiyuan Liu Teacher's article, paper download and github
In the this, contextual word embeddings can is flexibly obtained to measure contextual word similarity. We can also build document representations. And There are three models: twe-1,twe-2,twe-3, to see the structural differences with traditional skip-gram:

Accuracy in multi-label text categorization:

Baidu Open Source project Familia in the TWE model content display:

Please enter the subject number (0-10000)    :
embedding result              multinomial result
-------------------------------------- ----------Dialogue, dialogue, consultation, cooperation and consultation                                    on China's
non-party                                    consultations                                    China to                                    discuss
dialogue will                                  support
exchanges                                    including

The first column is based on the results of embedding, and the second column is based on the results of the multi-item distribution, sorted from large to small in terms of importance in the topic. to take a quick look at the train file:

Import Gensim #modified Gensim version Import pre_process # Read the Wordmap and the Tassgin file and create the sentence Import sys if __name__== "__main__": If Len (sys.argv)!=4:print "Usage:python train.py wordmap tassign topic_ Number "Sys.exit (1) Reload (SYS) sys.setdefaultencoding (' utf-8 ') Wordmapfile = sys.argv[1] Tassign File = sys.argv[2] topic_number = Int (sys.argv[3]) Id2word = Pre_process.load_id2word (wordmapfile) pre_process . load_sentences (Tassignfile, id2word) Sentence_word = gensim.models.word2vec.LineSentence ("tmp/word.file") print " Training the word vector ... "w = Gensim.models.Word2Vec (sentence_word,size=400, workers=20) sentence = Gensim.mode Ls.word2vec.CombinedSentence ("Tmp/word.file", "Tmp/topic.file") print "Training the topic vector ..." w.train_topic ( Topic_number, sentence) print "Saving the topic vectors ..." w.save_topic ("Output/topic_vector.txt") print "Sav ing the word vectors ... "
    W.save_wordvector ("Output/word_vector.txt") 

. Third, Sentencelda

Paper link + github:balikasg/topicmodelling Sentencelda is what.

An extension of LDA whose goal are to overcome this limitation by incorporating the structure of
The text in the generative and inference processes. Sentencelda and LDA differ.

LDA and Senlda differ in that the second assumes a very strong dependence of the latent topics between the words of Senten CES, whereas the first ssumes independence between the words of documents in general Sentencelda and LDA contrast experiments:

We illustrate the advantages of Sentencelda by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text Classification) Evaluation tasks on different text collections

The result of the original author's GitHub:

Https://github.com/balikasg/topicModelling/tree/master/senLDA
Intercepts part of the code:

Import NumPy as NP, Vocabulary_sentencelayer, String, Nltk.data, sys, codecs, JSON, time from nltk.tokenize import sent_to Kenize from Lda_sentencelayer import lda_gibbs_sampling1 from sklearn.cross_validation import Train_test_split, Stratifiedkfold from Nltk.stem import Wordnetlemmatizer to sklearn.utils import shuffle from functions import * PATH2TR aining = sys.argv[1] Training = Codecs.open (path2training, ' R ', encoding= ' UTF8 '). Read (). Splitlines () topics = Int (sys.ar GV[2]) Alpha, beta = 0.5/float (topics), 0.5/float (topics) Voca_en = Vocabulary_sentencelayer.vocabularysentencelayer ( Set (Nltk.corpus.stopwords.words (' 中文版 ')), Wordnetlemmatizer (), excluds_stopwords=true) Ldatrainingdata = Change_ Raw_2_lda_input (Training, voca_en, True) Ldatrainingdata = Voca_en.cut_low_freq (ldatrainingdata, 1) iterations = 201 CLA Ssificationdata, y = Load_classification_data (sys.argv[3], sys.argv[4]) Classificationdata = Change_raw_2_lda_input ( Classificationdata, Voca_en, False) classificatIondata = Voca_en.cut_low_freq (classificationdata, 1) FINAL_ACC, FINAL_MIF, FINAL_PERPL, Final_ar, Final_nmi, final_p, F  Inal_r, Final_f = [], [], [], [], [], [], [], [] start = Time.time () for J in Range (5): PERPL, CNT, ACC, MIF, AR, NMI, P, r, f = [], 0, [], [], [], [], [], [], [] Lda = Lda_gibbs_sampling1 (K=topics, Alpha=alpha, Beta=beta, docs= Ldatrai Ningdata, V=voca_en.size ()) for I in Range (iterations): Lda.inference () if I% 5 = = 0:pri
            NT "Iteration:", I, "perplexity:", lda.perplexity () features = Lda.heldoutperplexity (Classificationdata, 3) Print "Held-out:", features[0] scores = Perform_class (Features[1], y) acc.append (scores[ 0][0]) mif.append (scores[1][0]) perpl.append (Features[0]) final_acc.append (ACC) final_mif. Append (MIF) final_perpl.append (PERPL)
to see the end result of the Baidu Open source project, the content of Lda and Sentencelda shows:

LDA results:

Please enter the subject number (0-1999):
--------------------------------------------
Dialogue    0.189676
cooperation    0.0805558
China    0.0276284
consultation    0.0269797
Exchange    0.021069
Joint    0.0208559
Country    0.0183163
Discussion    0.0154165
support    0.0146714
includes    0.014198

The numeric value of the second column indicates how important the word is in this topic.
Sentencelda results:

Please enter the subject number (0-1999):
--------------------------------------------
zhejiang    0.0300595
Zhejiang province  0.0290975
Ningbo    0.0195277
reporter    0.0174735
Ningbo  0.0132504
Changchun 0.0123353 Street    0.0107271
Jinhua    0.00772971
Public Security Bureau, 0.00954326, Jilin province  0.00678163

. Iv. Copulalda

Sentencelda and Copulalda the same author, visible github:balikasg/topicmodelling
Did not look closely, to paste the effect is good:



. Reference Documents:

Familia a Chinese-themed modeling toolkit

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.