Simplified version of Computational text similarity _ text similarity

Source: Internet
Author: User
Tags idf

I simply calculated the "Post Masan Biography" and "Cold month Frost" text similarity, as well as "after Masan biography" and "Lonely Empty Court Spring late" text similarity, and did not remove punctuation, stop using words.

The use of TF-IDF,TF-IDF is a statistical method used to assess the importance of a word for one document in a file set or in a corpus. The importance of words increases in proportion to the number of times it appears in the file, but decreases inversely with the frequency with which it appears in the corpus. This is Baidu to find the explanation. In addition, the Dictionary.doc2bow method is to turn the document into a sparse vector. Sparse vectors can be represented either by id+ frequencies or by indices and values.

The final result:


Step: 1, read the document to be calculated

2, to carry out participle

3, the document to organize into ["words", "" ",......,"]

4, calculate the frequency of each word

5, for the large amount of data, filter out the occurrence of low-frequency words

6, the establishment of a dictionary through a new corpus

7, load to compare the document, repeat 2, 3 steps

8, the document will be compared to the sparse vector

9, according to the sparse vector to obtain a new corpus library

10, the new corpus for the TFIDF model calculation, get the value of TFIDF

11, based on the corpus, the number of features to establish the index

12, according to the index to get the final similarity degree

From Gensim import corpora, models, similarities
Import Jieba
Import Urllib.request
From collections Import Defaultdict

Doc1 = "D:/xx/xx/lengyueru.txt"
DOC2 = "D:/xx/xx/jimochun.txt"
D1 = open (Doc1, ' R ', encoding= ' utf-8 '). Read ()
D2 = open (doc2, ' R ', encoding= ' utf-8 '). Read ()

Data1 = jieba.cut (D1)
Data2 = Jieba.cut (D2)


'''
For item in DATA1:
Print (item)
For item in DATA2:
Print (item)
'''

DATA11 = ""
For item in DATA1:
DATA11 + = Item + ""
Data21 = ""

For item in DATA2:
Data21 + = Item + ""
Documents = [DATA11, Data21]
texts = [[Word for Word in document.split ()] for document in documents]
#split () is separated by a space by default
#这个是循环嵌套, the outer loop reads from the right to the left, and the inner loop is read forward from the back, which is in the Traverse document, traversing documents in turn, and assigning the value to Word
frequency = defaultdict (int)
#建立一个类似于字典的对象, where values are int,
For text in texts:
For token in text:
Frequency[token] + + 1
# filter out the words that appear low frequency, when the amount of data is very large
# Texts=[[word for word in text if frequency[token]>3]for text in texts]
Dictionary = corpora. Dictionary (texts)

#保存
#dictionary. Save ("D:/xx/xx/cidian.txt")
DOC3 = "D:/xx/xx/hougongzhenhuan.txt"
D3 = open (doc3, ' R ', encoding= ' utf-8 '). Read ()

#进行分词整理成 ["", "",......, "]
Data3 = Jieba.cut (D3)

Data31 = ""


For item in DATA3:
Data31 + = Item + ""

Newdoc = Data31
Newvec =dictionary.doc2bow (Newdoc.split ())
CORPUS = [
Dictionary.doc2bow (text) for text in texts]

#corpora. Mmcorpus.serialize ("d:/xx/xx/d3.mm", Corpus)


TFIDF =models. Tfidfmodel (Corpus)
#得到特征数通过token2id
Featurenum = Len (Dictionary.token2id.keys ())
#依据语料库, the similarity of the sparse matrix is computed by the feature number, and the index is established
index = similarities. Sparsematrixsimilarity (Tfidf[corpus], num_features=featurenum)
SIM = Index[tfidf[newvec]]

Print (SIM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.