1. Introduction
The crawler collects a lot of text data, how to carry out the weight? You can use the text to calculate the MD5, and then compare it with the MD5 collection that has been crawled, but there is a problem with a slightly different text MD5 value
The text similarity problem cannot be handled. Another way is this article to introduce the Simhash, which is Google proposed a local sensitive hashing algorithm, in Wu Teacher's "Mathematical beauty" also introduced, this algorithm can reduce the text into a
Number, which greatly reduces the computational amount of the de-redo operation. The Simhash algorithm is mainly divided into the following steps:
1. Participle and add weights to each word, representing the importance of the word in this sentence (consider using the TF-IDF algorithm)
2. Hash, divide each word map to a hash value
3. Weighted, according to the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", by weighting calculated as "4-4-4 4-4 4"; "51" has a hash value of "101011", weighted to "5-5 5-5 5 5".
4. Merge, add the sequence values calculated from the above words into only one sequence string
5. dimensionality reduction, if the sequence string each bit greater than 0 is recorded as 1, less than 0 is recorded as 0. Finally, the result is: "1 0 1 0 1 1".
Comparison of 2.SimHash
According to the above steps you can calculate a Simhash value for each text, the similarity of two Simhash is by comparing the number of different digits, which is called Hamming distance, such as 1and 0 10 , Hamming distance
to 3.
3. Improvement of comparative efficiency
Join us already have a simhash library, now there is a query to ask whether the library exists with this query Hamming distance of 1 to 3 text, how to query?
Mode 1. The result of the query Hamming distance of 1 to 3 real-time calculation, and then in the library to find, the disadvantage: the Hamming distance of 1 to 3 of the results may be tens of thousands, the query efficiency is certainly very low.
Way 2. The results of each simhash Hamming distance of 1 to 3 in the library are calculated in advance, so that each query requires only the complexity of O (1). Cons: The required storage space is very large.
Cond....
Text Check weight algorithm Simhash