Baidu recently open-source a new project on the theme model. Document topic Inference Tools, semantic matching calculation tools, and three thematic models based on industry-level corpus training: Latent
Dirichlet Allocation (LDA), Sentencelda, and topical Word embedding (TWE).
. I. Introduction of Familia
Help Familia, make a small ad ~ Familia's GitHub
The application paradigm of subject model in industry can be abstracted into two kinds: semantic representation and semantic matching.
Semantic representation (Semantic representation)
The document is subject to dimensionality reduction and the semantic representation of the document can be applied to downstream applications such as text categorization, text content analysis, CTR estimation, etc.
Semantic match (Semantic Matching)
To calculate the semantic matching between text, we provide two types of text similarity calculation:
-Short texts-long text similarity calculation, using scenes including document keyword extraction, calculation of search engine query and Web page similarity and so on.
-long text-long text similarity calculation, usage scenarios include calculating the similarity of two documents, calculating the similarity of user portrait and news, etc.
The demo Familia comes with the following features:Semantic representation calculation
The subject model is used to infer the topic of the input document, in order to get the topic dimensionality reduction representation of the document.
Semantic matching calculation
Calculates the similarity between text, including short text-long, long text-a similarity calculation between long texts.
Model Content Presentation
The theme of the model, the nearest neighbor words to display, user-friendly to the model of the subject has a visual understanding.
. second, topical Word embedding (TWE)
Zhiyuan Liu Teacher's article, paper download and github
In the this, contextual word embeddings can is flexibly obtained to measure contextual word similarity. We can also build document representations. And There are three models: twe-1,twe-2,twe-3, to see the structural differences with traditional skip-gram:
Accuracy in multi-label text categorization:
Baidu Open Source project Familia in the TWE model content display:
Please enter the subject number (0-10000) :
embedding result multinomial result
-------------------------------------- ----------Dialogue, dialogue, consultation, cooperation and consultation on China's
non-party consultations China to discuss
dialogue will support
exchanges including
The first column is based on the results of embedding, and the second column is based on the results of the multi-item distribution, sorted from large to small in terms of importance in the topic. to take a quick look at the train file:
Import Gensim #modified Gensim version Import pre_process # Read the Wordmap and the Tassgin file and create the sentence Import sys if __name__== "__main__": If Len (sys.argv)!=4:print "Usage:python train.py wordmap tassign topic_ Number "Sys.exit (1) Reload (SYS) sys.setdefaultencoding (' utf-8 ') Wordmapfile = sys.argv[1] Tassign File = sys.argv[2] topic_number = Int (sys.argv[3]) Id2word = Pre_process.load_id2word (wordmapfile) pre_process . load_sentences (Tassignfile, id2word) Sentence_word = gensim.models.word2vec.LineSentence ("tmp/word.file") print " Training the word vector ... "w = Gensim.models.Word2Vec (sentence_word,size=400, workers=20) sentence = Gensim.mode Ls.word2vec.CombinedSentence ("Tmp/word.file", "Tmp/topic.file") print "Training the topic vector ..." w.train_topic ( Topic_number, sentence) print "Saving the topic vectors ..." w.save_topic ("Output/topic_vector.txt") print "Sav ing the word vectors ... "
W.save_wordvector ("Output/word_vector.txt")
. Third, Sentencelda
Paper link + github:balikasg/topicmodelling Sentencelda is what.
An extension of LDA whose goal are to overcome this limitation by incorporating the structure of
The text in the generative and inference processes. Sentencelda and LDA differ.
LDA and Senlda differ in that the second assumes a very strong dependence of the latent topics between the words of Senten CES, whereas the first ssumes independence between the words of documents in general Sentencelda and LDA contrast experiments:
We illustrate the advantages of Sentencelda by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text Classification) Evaluation tasks on different text collections
The result of the original author's GitHub:
Https://github.com/balikasg/topicModelling/tree/master/senLDA
Intercepts part of the code:
Import NumPy as NP, Vocabulary_sentencelayer, String, Nltk.data, sys, codecs, JSON, time from nltk.tokenize import sent_to Kenize from Lda_sentencelayer import lda_gibbs_sampling1 from sklearn.cross_validation import Train_test_split, Stratifiedkfold from Nltk.stem import Wordnetlemmatizer to sklearn.utils import shuffle from functions import * PATH2TR aining = sys.argv[1] Training = Codecs.open (path2training, ' R ', encoding= ' UTF8 '). Read (). Splitlines () topics = Int (sys.ar GV[2]) Alpha, beta = 0.5/float (topics), 0.5/float (topics) Voca_en = Vocabulary_sentencelayer.vocabularysentencelayer ( Set (Nltk.corpus.stopwords.words (' 中文版 ')), Wordnetlemmatizer (), excluds_stopwords=true) Ldatrainingdata = Change_ Raw_2_lda_input (Training, voca_en, True) Ldatrainingdata = Voca_en.cut_low_freq (ldatrainingdata, 1) iterations = 201 CLA Ssificationdata, y = Load_classification_data (sys.argv[3], sys.argv[4]) Classificationdata = Change_raw_2_lda_input ( Classificationdata, Voca_en, False) classificatIondata = Voca_en.cut_low_freq (classificationdata, 1) FINAL_ACC, FINAL_MIF, FINAL_PERPL, Final_ar, Final_nmi, final_p, F Inal_r, Final_f = [], [], [], [], [], [], [], [] start = Time.time () for J in Range (5): PERPL, CNT, ACC, MIF, AR, NMI, P, r, f = [], 0, [], [], [], [], [], [], [] Lda = Lda_gibbs_sampling1 (K=topics, Alpha=alpha, Beta=beta, docs= Ldatrai Ningdata, V=voca_en.size ()) for I in Range (iterations): Lda.inference () if I% 5 = = 0:pri
NT "Iteration:", I, "perplexity:", lda.perplexity () features = Lda.heldoutperplexity (Classificationdata, 3) Print "Held-out:", features[0] scores = Perform_class (Features[1], y) acc.append (scores[ 0][0]) mif.append (scores[1][0]) perpl.append (Features[0]) final_acc.append (ACC) final_mif. Append (MIF) final_perpl.append (PERPL)
to see the end result of the Baidu Open source project, the content of Lda and Sentencelda shows:
LDA results:
Please enter the subject number (0-1999):
--------------------------------------------
Dialogue 0.189676
cooperation 0.0805558
China 0.0276284
consultation 0.0269797
Exchange 0.021069
Joint 0.0208559
Country 0.0183163
Discussion 0.0154165
support 0.0146714
includes 0.014198
The numeric value of the second column indicates how important the word is in this topic.
Sentencelda results:
Please enter the subject number (0-1999):
--------------------------------------------
zhejiang 0.0300595
Zhejiang province 0.0290975
Ningbo 0.0195277
reporter 0.0174735
Ningbo 0.0132504
Changchun 0.0123353 Street 0.0107271
Jinhua 0.00772971
Public Security Bureau, 0.00954326, Jilin province 0.00678163
. Iv. Copulalda
Sentencelda and Copulalda the same author, visible github:balikasg/topicmodelling
Did not look closely, to paste the effect is good:
. Reference Documents:
Familia a Chinese-themed modeling toolkit