Implementation of local sensitive hashing algorithm

Source: Internet
Author: User

Recently, due to work needs, we need to increase the string similarity calculation speed. Previously, the longest common subsequence, editing distance, and other algorithms were used for implementation, but the performance and speed requirements of real-time comparison were not always met. A few days ago, my colleague recommended a local sensitive hash algorithm and tried it. The result showed that the speed was good. In the spirit of recording and sharing, I briefly summarized the implementation process and ideas.

[Shingle]

Map the string set to be queried to a set, such as the string "abcdeeeefg", to the set "(a, B, c, d, e, f, g) ", note that there are no repeated elements in the set. This step is actually called shingling, which means to build a short string set in the document, that is, the shingle set.

This is the simplest ing. It can be split by a single character or mapped to a more complex set, such as (AB, BC, de, EF, FG), (ABC, BCD, def, EFG), etc.

A set mapped to a string set can be hashed in one step, such as a hash to 1, B hash to 2, and C hash to 1. ing TO A bucket has a great benefit, yes, the data volume can be reduced. After ing to the bucket, we can use the bucket number to represent the string.

[Feature Matrix]

Assume that there are K buckets. After shingling Based on the string and ing to the bucket, we can obtain a matrix, which can be called a feature matrix. This matrix uses the hash value of the bucket as the row, take the document string set as the column. The element in the matrix is 0 or 1, which indicates whether it can be mapped to the corresponding bucket. For example:

 


String 1 String 2 String 3 String 4
Bucket 1 0 1 1 1
Bucket 2 1 0 0 1
Bucket 3 0 0 0 0
Bucket 4 1 0 1 0
Bucket 5 0 1 0 1

 

In the matrix, 1 indicates that the shingle set of the corresponding column string has elements that can be mapped to the corresponding bucket. 0 indicates that no element is mapped to the corresponding bucket.

From the book Big Data Internet large-scale data mining and distributed processing, we can know that the similarity between strings can be measured by the bucket similarity they map, for example, the similarity between string 1 and string 2 can be expressed by the similarity (0 1 0 1 0) and (1 0 0 1 ).

[Arrangement and conversion]

However, the shingle set is generally very large. Even if each shingle is hash to four bytes (that is, the bucket in the table above), it is very likely that the shingle set of strings cannot be all put into the memory, so what can we do to avoid large data volumes? If we map the feature matrix to a smaller signature and use a signature to replace the feature matrix, isn't it better?

To calculate the minimum hash of the set represented by each column in the feature matrix, first select an arrangement and conversion for the rows. The minimum hash value of any column is the row number of the first row with a value of 1 in the row arrangement order after the arrangement and transformation. For example, we will disrupt the bucket order in the above table.


String 1 String 2 String 3 String 4
Bucket 2 1 0 0 1
Bucket 3 0 0 0 0
Bucket 1 0 1 1 1
Bucket 4 1 0 1 0
Bucket
5
0 1 0 1

 

For string 1, it can already be encountered in bucket 2 in the first line. Then, set hash signature h of string 1 to the corresponding hash value of Bucket 2, that is, H (string 1) = bucket 2 hash; similarly, H (string 2) = bucket 1 hash; H (string 3) = bucket 1 hash; H (string 4) = bucket 2 hash;

Calculation of the minimum hash signature Matrix]

If we select n permutation transformations for processing the preceding feature matrix, for the feature matrix columns, we call the minimum hash functions H1, h2 ...., hn to construct the minimum hash signature matrix of the feature matrix.

Assume that Sig is the final minimum hash signature matrix, and SIG (I, C) is the element of the I hash function in the signature matrix in Column C. At the beginning, the matrix element is set to infinity. Then, process the r of each row as follows:

(1) Calculate H1 (R), H2 (R),... hn (r)

(2). process each column of C as follows:

(A) If Column C in the feature matrix performs column R 0, nothing will be done.

(B) If the r Behavior 1 in Column C in the feature matrix, set sig (I, C) to the minimum values of the original sig (I, C) and HI (R ).

For easy calculation, we can do this. For each column in the feature matrix, we apply the hash function hi to all the rows in this column and obtain the minimum hash value, you can obtain the signature of the column corresponding to the hash function hi. apply all the hash functions to obtain the signature vector of the column.

Finally, the signature matrix of N rows is obtained, which is much smaller than the feature matrix.

At this time, the similarity calculation of the two strings can be converted into the similarity calculation of the corresponding signature vector. How can we measure the similarity of the Vector? We use jaccard Similarity

[Jaccard similarity]

Jaccard similarity is used to calculate the similarity between two sets, that is, the ratio between the intersection of the two sets and the size of the Union set.

Using jaccard similarity, we can calculate the similarity between the signature vectors corresponding to the string. However, if the number of strings is too large, it is time-consuming to compare the signature vectors, in reality, we usually only need to obtain the string pairs with the most similar or similarity greater than a certain threshold value. If we can first find these candidate string pairs and then use the jaccard similarity, can it greatly reduce the computing workload?

[Line-based policy]

Obviously, if two strings are similar, the signature vectors corresponding to them should be similar. They are very likely to be the same within a certain local range. If hash is used, you can map to the same bucket.

Suppose we divide the signature matrix into several rows, each of which contains a Series. If the two strings are similar, we guess they must be in a row, the signature vectors corresponding to these two strings should be equal, and they will be mapped to the same bucket.

For each row, we set a large bucket to calculate its hash value for each column in the row. Columns with the same hash value are mapped to the same hash bucket. The columns in the hash bucket form a candidate pair.

Calculate the jaccard similarity for the final candidate pair

[Notes]

1. the minimum hash function family is the key to performance. Hash functions must be independent from each other. Do not make a hash function larger or smaller than another hash function under the same parameter conditions, it is best to be messy and irrelevant.

2. The value range mapped to the minimum hash function should not be too small. If it is too small, conflicts may occur.

3. The value range mapped by hash functions used for Row-based partitioning should not be too small. If it is too small, it is also prone to conflicts. Too many candidates to be queried in the future will affect the speed.

4. Pay attention to the application of large prime numbers and construct hash functions. Large prime numbers are quite useful.

Note: The Principles and expressions are taken from chapter 3 of big data Internet large-scale data mining and distributed processing.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.