Anaconda3 Python 3.6, Window 64bit
Using Jieba for Word segmentation, keyword extraction
Based on the corpora,models,similarities of Gensim, the model TFIDF algorithm and sparse matrix similarity analysis
#-*-coding:utf-8-*-ImportJieba fromGensimImportCorpora, models, similarities fromCollectionsImportdefaultdict#Definition file DirectoryWork_dir ="D:/workspace/pythonsdy/data"F1= Work_dir +"/t1.txt"F2= Work_dir +"/t2.txt"#Read File contentsC1 = Open (F1, encoding='Utf-8'). Read () C2= Open (F2, encoding='Utf-8'). Read ()#Jieba for participleData1 =Jieba.cut (C1) Data2=jieba.cut (C2) data11=""#get word breaker content forIinchData1:data11+ = i +" "Data21=""#get word breaker content forIinchData2:data21+ = i +" "Doc1=[Data11, Data21]#print (Doc1)T1= [[Word forWordinchdoc.split ()] forDocinchDoc1]#print (t1)## Frequence FrequencyFreq =defaultdict (int) forIinchT1: forJinchI:freq[j]+ = 1#print (freq)#Limit Word frequencyt2 = [Token forTokeninchKifFREQ[J] >= 3] forKinchT1]Print(T2)#corpora Corpus Build DictionaryDic1 =Corpora. Dictionary (T2) dic1.save (Work_dir+"/yuliaoku.txt")#Compare FilesF3 = Work_dir +"/t3.txt"C3= Open (F3, encoding='Utf-8'). Read ()#Jieba for participleData3 =Jieba.cut (C3) Data31="" forIinchData3:data31+ = i +" "New_doc=Data31Print(New_doc)#Doc2bow the file into a sparse vectorNew_vec =Dic1.doc2bow (New_doc.split ())#Doc2bow processing of dictionaries to obtain new CorporaNew_corpor = [Dic1.doc2bow (T3) forT3inchT2]TFIDF=models. Tfidfmodel (New_corpor)#Number of featuresFeaturenum =Len (Dic1.token2id.keys ())#similarities of similarities#sparsematrixsimilarity sparse matrix similarity degreeIDX = similarities. Sparsematrixsimilarity (Tfidf[new_corpor], num_features=featurenum) Sims=Idx[tfidf[new_vec]]Print(Sims)
View Code
From the results can be drawn: the comparison of the file 3 and the contents of document 2 is more similar.
Similarity analysis of Python text