Topic Center

Contact Sales

Home > Others

simhash--a document de-weight algorithm

Last Update:2015-08-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When the first look at the beauty of Mathematics, the book mentions this algorithm, at that time did not do related to work, no specific impression. A year ago when the interview when the other people mentioned this algorithm, know that Simhash can be used to solve the Web page and other massive data deduplication problem, very efficient.

Then I probably realized the Python version of the algorithm, and tried it, and it felt good, mark.

#Coding=utf-8Importossingle_bits= {} forXinchXrange (32): Single_bits[x]= 1 <<xPrintsingle_bitsdefSimhash (str): Simhash_map= {}     forXinchXrange (32): Simhash_map[x]=0 forCinchstr:h=Hash (c) forBitinchsingle_bits:ifH & single_bits[bit] = =0:simhash_map[bit]-= 1Else: Simhash_map[bit]+ = 1result=0 forXinchXrange (32):        ifSIMHASH_MAP[X] >0:result|=Single_bits[x]returnresultdefHaiming_dis (SIMHASH1, simhash2): Dis=0 SH= Simhash1 ^Simhash2 forXinchXrange (32):        ifSh & (1 << x) >0:dis+ = 1returnDisif __name__=="__main__": Str="in fact, the traditional comparison of two text similarity method, most of the text after the word segmentation, the conversion to the eigenvector distance measurement, such as the common Euclidean distance, Hamming distance or cosine angle and so on. 22 comparisons are well adapted, but one of the biggest drawbacks of this approach is that they cannot be extended to massive amounts of data. For example, imagine a Google that contains a number of billions of of Internet information, a large search engine, every day through the crawler of its own index library to add millions of pages, if you want to include every piece of data in the Web library and each record of the cosine angle, the calculation is quite scary. "str2="in fact, the traditional method of comparing two text similarity is mostly to convert the text after word segmentation to the distance of eigenvectors, such as common geometric distance or cosine angle, etc. 22 comparisons are well adapted, but one of the biggest advantages of this approach is that they cannot be extended to massive amounts of data. For example, like Baidu, which contains a number of billions of of Internet information, a large search engine, every day through the way of the crawler for their own index of the new millions of Web pages, if you want to include each piece of data and the Web library every learning to calculate the cosine angle, the calculation of the amount is quite scary. "SH1=Simhash (str) SH2=Simhash (str2)PrintSH1PrintSH2PrintHaiming_dis (SH1, SH2)

The output is:

354447120535402850932

simhash--a document de-weight algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

simhash--a document de-weight algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support