I simply calculated the "Post Masan Biography" and "Cold month Frost" text similarity, as well as "after Masan biography" and "Lonely Empty Court Spring late" text similarity, and did not remove punctuation, stop using words.
The use of TF-IDF,TF-IDF is a statistical method used to assess the importance of a word for one document in a file set or in a corpus. The importance of words increases in proportion to the number of times it appears in the file, but decreases inversely with the frequency with which it appears in the corpus. This is Baidu to find the explanation. In addition, the Dictionary.doc2bow method is to turn the document into a sparse vector. Sparse vectors can be represented either by id+ frequencies or by indices and values.
The final result:
Step: 1, read the document to be calculated
2, to carry out participle
3, the document to organize into ["words", "" ",......,"]
4, calculate the frequency of each word
5, for the large amount of data, filter out the occurrence of low-frequency words
6, the establishment of a dictionary through a new corpus
7, load to compare the document, repeat 2, 3 steps
8, the document will be compared to the sparse vector
9, according to the sparse vector to obtain a new corpus library
10, the new corpus for the TFIDF model calculation, get the value of TFIDF
11, based on the corpus, the number of features to establish the index
12, according to the index to get the final similarity degree
From Gensim import corpora, models, similarities
Import Jieba
Import Urllib.request
From collections Import Defaultdict
Doc1 = "D:/xx/xx/lengyueru.txt"
DOC2 = "D:/xx/xx/jimochun.txt"
D1 = open (Doc1, ' R ', encoding= ' utf-8 '). Read ()
D2 = open (doc2, ' R ', encoding= ' utf-8 '). Read ()
Data1 = jieba.cut (D1)
Data2 = Jieba.cut (D2)
'''
For item in DATA1:
Print (item)
For item in DATA2:
Print (item)
'''
DATA11 = ""
For item in DATA1:
DATA11 + = Item + ""
Data21 = ""
For item in DATA2:
Data21 + = Item + ""
Documents = [DATA11, Data21]
texts = [[Word for Word in document.split ()] for document in documents]
#split () is separated by a space by default
#这个是循环嵌套, the outer loop reads from the right to the left, and the inner loop is read forward from the back, which is in the Traverse document, traversing documents in turn, and assigning the value to Word
frequency = defaultdict (int)
#建立一个类似于字典的对象, where values are int,
For text in texts:
For token in text:
Frequency[token] + + 1
# filter out the words that appear low frequency, when the amount of data is very large
# Texts=[[word for word in text if frequency[token]>3]for text in texts]
Dictionary = corpora. Dictionary (texts)
#保存
#dictionary. Save ("D:/xx/xx/cidian.txt")
DOC3 = "D:/xx/xx/hougongzhenhuan.txt"
D3 = open (doc3, ' R ', encoding= ' utf-8 '). Read ()
#进行分词整理成 ["", "",......, "]
Data3 = Jieba.cut (D3)
Data31 = ""
For item in DATA3:
Data31 + = Item + ""
Newdoc = Data31
Newvec =dictionary.doc2bow (Newdoc.split ())
CORPUS = [
Dictionary.doc2bow (text) for text in texts]
#corpora. Mmcorpus.serialize ("d:/xx/xx/d3.mm", Corpus)
TFIDF =models. Tfidfmodel (Corpus)
#得到特征数通过token2id
Featurenum = Len (Dictionary.token2id.keys ())
#依据语料库, the similarity of the sparse matrix is computed by the feature number, and the index is established
index = similarities. Sparsematrixsimilarity (Tfidf[corpus], num_features=featurenum)
SIM = Index[tfidf[newvec]]
Print (SIM)