Using the cosine theorem to calculate the similarity between two articles: (Methodology, detailed version)
http://blog.csdn.net/dearwind153/article/details/52316151
Python Implementation (code):
Http://outofmemory.cn/code-snippet/35172/match-text-release
(Stuttering word download and installation: http://www.cnblogs.com/kaituorensheng/p/3595879.html)
Java Implementation (code + method description):
https://my.oschina.net/leejun2005/blog/116291
(The above is my reference information)
--------------------------------------------------------------------------------------------------------------- --------------------------------
I'm using python, and I need to install a Python package for stuttering participle.
The code is as follows:
#!/usr/bin/env python#-*-coding:utf-8-*import refrom Math Import sqrt#you has to install the Python libimport jiebade F File_reader (filename,filename2): File_words = {} ignore_list = [u ' ', u ' up ', U ' and ', U ' em ', U ' ah ', U ' Oh ', u ' en ', u ' er ', U ' Bar ']; Accepted_chars = Re.compile ("[\\u4e00-\\u9fa5]+") File_object = open (filename) Try:all_the_text = File_obje Ct.read () seg_list = Jieba.cut (All_the_text, cut_all=true) #print "/". Join (Seg_list) for s in Seg_li St:if Accepted_chars.match (s) and s not in ignore_list:if s not in File_words.keys (): File_words[s] = [1,0] else:file_words[s][0] + = 1 finally:file_obj Ect.close () File_object2 = open (filename2) Try:all_the_text = File_object2.read () seg_list = Jieba.cu T (All_the_text, Cut_all=true) for S in Seg_list:if Accepted_chars.match (s) and s not in Ignore_list: If s not inFile_words.keys (): file_words[s] = [0,1] else:file_words[s][1] + = 1 Finally:file_object2.close () sum_2 = 0 Sum_file1 = 0 sum_file2 = 0 for Word in file_words.values (): Sum_2 + = word[0]*word[1] Sum_file1 + word[0]**2 sum_file2 + = word[1]**2 rate = sum_2/(sqrt (sum_f Ile1*sum_file2) Print (' Rate: ') print (rate) file_reader (' Thefile.txt ', ' Thefile2.txt ') #该片段来自于http://outofmemory.cn
Using cosine theorem to calculate similarity of two articles