Efficient Web page de-weight algorithm-simhash

Source: Internet
Author: User
Tags idf

I remember someone asked me before, the page to go to the weight algorithm, I have to say the cosine vector similarity match, but if it is billions of levels of the page to weight it? This is bad, because every two pages need to calculate the vector inner product, check the efficiency is too low! I was thinking: the search efficiency is definitely to consider the hash algorithm, the same string hashcode must be the same, different strings of the hashcode is not the same, this does not meet the requirements Ah, there will be an algorithm can make similar string code value is same or similar, So I found Google's Web page to re--simhash algorithm. Before we use the Simhash algorithm, we need to select the number of bits of the Simhash code according to the document magnitude, which is generally optional 32-bit or 64-bit.

1 Main Concepts

Hamming distance: In the information encoding, two legal codes correspond to the BITS encoded in different digits called the code pitch, also known as the Hamming distance.

Note: For example, the legal code length is 8, then 00111100 and 11110000 Hamming distance is 4,10101111 and 01101111 Hamming distance is 2,11110000 and 11110000 of the Hamming distance is 0.

2 Algorithmic Flow

1) participle

Word segmentation, and then assign weights for each word (for example, you can use the TF-IDF algorithm to calculate weights, but here you need to transform the algorithm, the TF-IDF value with a monotonically increasing function map to an integer value), for example: I (3) is (2) Chinese (5), I (3) Love (4) My (3) (1) Motherland (5). In parentheses is the weight, the higher the weight, the more important the word is in the document. The next step is to remove the TF-IDF value too low, which will filter out the function.

2) Calculate hash

Calculate the hash value of each word, such as "I"--01001000, "is"--10110011, "Chinese"--11001100, "Love"--10101010, "Motherland"--01011000

3) Weighted

Multiply the words by the corresponding weights, 0 with 1 instead of the corresponding weights, so that, "I" ——-33-3-33-3-3-3, "is"--2-222-2-222, "Chinese"--55-5-555-5-5, "Love"--4-44-44-44-4, "Motherland" ——- 55-555-5-5-5

4) merger

Add the word sequence from the front to the back, and the cumulative result is 3,7,-7,-5,15,-9,-7,-15

5) Dimension Reduction

The result of the 4th step is changed to 0-1 strings, the method is greater than 0->1, less than 0->0, so the result is 11001000, so that each document will get an ID

6) Compare Hamming distance

The result of the 5th step is different from the ID of each document, and then the number of 1 in the result of the operation is obtained. ) to get the Hamming distance. Typically, for long documents, a Hamming distance of less than 3 is considered a document. For micro-Bo and other short, it is said, Hamming distance can be set larger, such as 10 or less will be considered the same document.

3 Advantages and disadvantages of algorithms

Advantages:

1) efficient algorithm, very suitable for large-scale Web page to heavy

2) algorithms are very easy to use in distributed computing such as MapReduce

3) The algorithm consumes very little space for each document

Disadvantages:

1) for long documents and short files at the same time, only rely on the algorithm itself is not a perfect solution to the problem of web-heavy

2) for two seemingly unrelated documents, the Hamming distance may even be 0, but the probability of such a situation is very small

Efficient Web page de-weight algorithm-simhash

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.