simhash--a document de-weight algorithm

Source: Internet
Author: User

When the first look at the beauty of Mathematics, the book mentions this algorithm, at that time did not do related to work, no specific impression. A year ago when the interview when the other people mentioned this algorithm, know that Simhash can be used to solve the Web page and other massive data deduplication problem, very efficient.

Then I probably realized the Python version of the algorithm, and tried it, and it felt good, mark.

#Coding=utf-8Importossingle_bits= {} forXinchXrange (32): Single_bits[x]= 1 <<xPrintsingle_bitsdefSimhash (str): Simhash_map= {}     forXinchXrange (32): Simhash_map[x]=0 forCinchstr:h=Hash (c) forBitinchsingle_bits:ifH & single_bits[bit] = =0:simhash_map[bit]-= 1Else: Simhash_map[bit]+ = 1result=0 forXinchXrange (32):        ifSIMHASH_MAP[X] >0:result|=Single_bits[x]returnresultdefHaiming_dis (SIMHASH1, simhash2): Dis=0 SH= Simhash1 ^Simhash2 forXinchXrange (32):        ifSh & (1 << x) >0:dis+ = 1returnDisif __name__=="__main__": Str="in fact, the traditional comparison of two text similarity method, most of the text after the word segmentation, the conversion to the eigenvector distance measurement, such as the common Euclidean distance, Hamming distance or cosine angle and so on. 22 comparisons are well adapted, but one of the biggest drawbacks of this approach is that they cannot be extended to massive amounts of data. For example, imagine a Google that contains a number of billions of of Internet information, a large search engine, every day through the crawler of its own index library to add millions of pages, if you want to include every piece of data in the Web library and each record of the cosine angle, the calculation is quite scary. "str2="in fact, the traditional method of comparing two text similarity is mostly to convert the text after word segmentation to the distance of eigenvectors, such as common geometric distance or cosine angle, etc. 22 comparisons are well adapted, but one of the biggest advantages of this approach is that they cannot be extended to massive amounts of data. For example, like Baidu, which contains a number of billions of of Internet information, a large search engine, every day through the way of the crawler for their own index of the new millions of Web pages, if you want to include each piece of data and the Web library every learning to calculate the cosine angle, the calculation of the amount is quite scary. "SH1=Simhash (str) SH2=Simhash (str2)PrintSH1PrintSH2PrintHaiming_dis (SH1, SH2)

The output is:

354447120535402850932

simhash--a document de-weight algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.