Introduction and application of Simhash algorithm for mass data de-weight

Source: Internet
Author: User

What is Simhash?

Simhash is a fingerprint generation algorithm or fingerprint extraction algorithm that was mentioned in the paper "Detecting near-duplicates for Web crawling" published by Google in 2007. Google is widely used in billions of pages to heavy job, as locality sensitive hash (local sensitive hash), the main idea is to reduce dimensions, what is dimensionality reduction? As an example of a popular point, a number of textual content, after Simhash dimensionality, may only be a length of 32 or 64 bits of binary composed of 01 strings, which is very similar to our ID card, imagine if you want to find a person in the vast sea of China 1.3 billion +, If you do not know this person's identity card, you may have to provide name, address, height, weight, gender, and so on dimension factors to determine whether for a person, from this example can be seen, if there is a one-dimensional core condition identity card, then the query is very fast, if there is no one-dimensional identity card conditions, May synthesize a few other non-core dimensions, also can identify a person, but this kind of query is relatively slow, and through our simhash algorithm, it is like to give each life become an identity card, make complex things, can reduce dimension to simplify.

Simhash working principle Simhash algorithm Workflow:  explained under: (1) Prepare a text (2) filter cleaning, extract n features keyword, this step is generally used in the method of word segmentation, about participle, more commonly used have IK,MMSEG4J,ANSJ (3) feature weighting , this step if you have their own definition of an industry, the corpus can be used, no words after the word frequency can be (4) the key words to hash the 01 composition of the signature (6 bits above) (5) and then vector weighted, for each 6-bit signature of each bit, if 1, Hash and weight are multiplied, if 0, then the hash and weight negative multiplication, so that the vector of each eigenvalue can be obtained. (6) Merge all the eigenvectors together, get a final vector, and then dimension, for the final vector of each bit if greater than 0 is 1, otherwise 0, so you can get the final simhash of the fingerprint signature   An example is as follows:  Simhash application through the above steps, we can use the Simhash algorithm for each page to generate a vector fingerprint, then the question, how to determine the similarity of 2 text? This is mainly applied to the Hamming distance.   (1) What is the Hamming distance of two code words corresponding to the bit value of the different bits is called the Hamming distance of these two code words. In a valid encoding set, the minimum value of the Hamming distance of any two code word is called the Hamming distance of the coded set. Examples are as follows: 10101 and 00110 start with the first position, the first, the fifth, the Hamming distance is 3.   (2) The geometric meaning of Hamming distance the code word of n-bit can be represented by a vertex of the hypercube in n-dimensional space. The Hamming distance between two code words is an edge between the two vertices of the hypercube, and is the shortest distance between the two vertices.   (3) The application scenario of Hamming distance is used to encode the error detection and error correction    the fingerprint extracted by the Simhash algorithm (Simhash for long text 500 words + more applicable, the short passage may be a large deviation, the specific need to test according to the actual scene), Finally use Hamming distance, for similar, in the data given by Google's paper, 64-bit signature, in the case of Hamming distance of 3, can be considered that two documents are similar or duplicate, of course, this value is only a reference value, for their own application may be different test value   To here the similarity problem is basically solved, but according to this idea, in the volume of data tens of billions of, the efficiency problem is still unresolved, because the data is constantly added, it is not possible to each data, and the whole library of data to do a comparison, according to this idea, processing speed will be slower, linear growth.   In this issue in Google's paper also put forward the corresponding ideas, according to the Pigeon Nest principle (AlsoThe principle of the drawer):  There are ten apples on the table, to put the ten apples in nine drawers, no matter how they put it, we will find at least one drawer with at least two apples. This phenomenon is what we call the "drawer principle". The general meaning of the drawer principle is: "If each drawer represents a set, each apple can represent an element, and if there are n+1 elements placed in n sets, there must be at least two elements in a set." The drawer principle is sometimes called the pigeon nest principle. It is an important principle in combinatorial mathematics. [1]     The truth is very simple, but in this application to the real problem, but can play a huge role, which is the secret of mathematics.   for the massive data deduplication efficiency, we can be 64-digit fingerprint, cut into 4 parts 16 bits of data block, according to the principle of the drawer in the Hamming distance of 3, if two documents similar, then it must have a block of data is equal, Then 4 copies of the data through the K-V database or inverted index storage K for 16-bit truncation fingerprint, V is equal to the remaining 48-bit fingerprint collection, query time, the exact matching of the fingerprint 4 16-bit truncation,:  so, assuming the sample library, there is 2^34 data (17.1 billion data), Assuming that the data is evenly distributed, the maximum number of rows returned for each 16-bit (16 01-digit randomly composed combination of 2^16) is 2^34/2^16=2^ (34-16) = 262,144 candidate results, 4 16-bit truncated indexes, and the total result is: 4*262144= 1048576, about more than 1 million, through the reduction of dimensional processing, the original need to compare 17.1 billion times, now only need to compare 1 million times to get results, which greatly improved the computational efficiency.    Reference article: http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.htmlhttp:// taop.marchtea.com/06.03.htmlhttp://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.htmlhttp:// Www.cnblogs.com/colorfulkoala/archive/2012/07/29/2614382.htmlhttp://en.wikipedia.org/wiki/Locality_sensitive_ hashinghttp://grunt1223.iteye.com/blog/964564

Introduction and application of Simhash algorithm for mass data deduplication

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.