Text Check weight algorithm Simhash

Source: Internet
Author: User

1. Introduction

The crawler collects a lot of text data, how to carry out the weight? You can use the text to calculate the MD5, and then compare it with the MD5 collection that has been crawled, but there is a problem with a slightly different text MD5 value

The text similarity problem cannot be handled. Another way is this article to introduce the Simhash, which is Google proposed a local sensitive hashing algorithm, in Wu Teacher's "Mathematical beauty" also introduced, this algorithm can reduce the text into a

Number, which greatly reduces the computational amount of the de-redo operation. The Simhash algorithm is mainly divided into the following steps:

1. Participle and add weights to each word, representing the importance of the word in this sentence (consider using the TF-IDF algorithm)

2. Hash, divide each word map to a hash value

3. Weighted, according to the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", by weighting calculated as "4-4-4 4-4 4"; "51" has a hash value of "101011", weighted to "5-5 5-5 5 5".

4. Merge, add the sequence values calculated from the above words into only one sequence string

5. dimensionality reduction, if the sequence string each bit greater than 0 is recorded as 1, less than 0 is recorded as 0. Finally, the result is: "1 0 1 0 1 1".

Comparison of 2.SimHash

According to the above steps you can calculate a Simhash value for each text, the similarity of two Simhash is by comparing the number of different digits, which is called Hamming distance, such as 1and 0 10 , Hamming distance

to 3.

3. Improvement of comparative efficiency

Join us already have a simhash library, now there is a query to ask whether the library exists with this query Hamming distance of 1 to 3 text, how to query?

Mode 1. The result of the query Hamming distance of 1 to 3 real-time calculation, and then in the library to find, the disadvantage: the Hamming distance of 1 to 3 of the results may be tens of thousands, the query efficiency is certainly very low.

Way 2. The results of each simhash Hamming distance of 1 to 3 in the library are calculated in advance, so that each query requires only the complexity of O (1). Cons: The required storage space is very large.

Cond....

Text Check weight algorithm Simhash

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.