When the first look at the beauty of Mathematics, the book mentions this algorithm, at that time did not do related to work, no specific impression. A year ago when the interview when the other people mentioned this algorithm, know that Simhash can be used to solve the Web page and other massive data deduplication problem, very efficient.
Then I probably realized the Python version of the algorithm, and tried it, and it felt good, mark.
#Coding=utf-8Importossingle_bits= {} forXinchXrange (32): Single_bits[x]= 1 <<xPrintsingle_bitsdefSimhash (str): Simhash_map= {} forXinchXrange (32): Simhash_map[x]=0 forCinchstr:h=Hash (c) forBitinchsingle_bits:ifH & single_bits[bit] = =0:simhash_map[bit]-= 1Else: Simhash_map[bit]+ = 1result=0 forXinchXrange (32): ifSIMHASH_MAP[X] >0:result|=Single_bits[x]returnresultdefHaiming_dis (SIMHASH1, simhash2): Dis=0 SH= Simhash1 ^Simhash2 forXinchXrange (32): ifSh & (1 << x) >0:dis+ = 1returnDisif __name__=="__main__": Str="in fact, the traditional comparison of two text similarity method, most of the text after the word segmentation, the conversion to the eigenvector distance measurement, such as the common Euclidean distance, Hamming distance or cosine angle and so on. 22 comparisons are well adapted, but one of the biggest drawbacks of this approach is that they cannot be extended to massive amounts of data. For example, imagine a Google that contains a number of billions of of Internet information, a large search engine, every day through the crawler of its own index library to add millions of pages, if you want to include every piece of data in the Web library and each record of the cosine angle, the calculation is quite scary. "str2="in fact, the traditional method of comparing two text similarity is mostly to convert the text after word segmentation to the distance of eigenvectors, such as common geometric distance or cosine angle, etc. 22 comparisons are well adapted, but one of the biggest advantages of this approach is that they cannot be extended to massive amounts of data. For example, like Baidu, which contains a number of billions of of Internet information, a large search engine, every day through the way of the crawler for their own index of the new millions of Web pages, if you want to include each piece of data and the Web library every learning to calculate the cosine angle, the calculation of the amount is quite scary. "SH1=Simhash (str) SH2=Simhash (str2)PrintSH1PrintSH2PrintHaiming_dis (SH1, SH2)
The output is:
354447120535402850932
simhash--a document de-weight algorithm