Similarity Data Detection Algorithms calculate the similarity ([], 1 indicates the same) or distance ([0,), 0 indicates the same) between a given pair of data sequences ), to measure the degree of similarity between data. Similar data detection has very important application values in the information science field, such as the clustering and sorting of search engine search results, Data Clustering and classification, spam detection, paper Plagiarism detection, deduplication, delta data encoding, and other applications. It is precisely because of its importance that it has become the focus of research in recent years, and new detection methods are constantly emerging and evaluated. Among them, Broder's shingling algorithm and Charikar's simhash algorithm are regarded as the best algorithms so far.
For similar data detection, the Unix diff-like method can be used most easily. UNIX diff compares documents one by one to detect similar files. It uses the classic LCS (longest common subsequence, Longest Common substring) algorithm and uses dynamic programming to calculate similarity. The meaning of LCS is the longest Character Sequence simultaneously contained in the string. The length of LCS is used as a measure of the similarity between the two strings. The diff algorithm uses the entire line as a "character" to calculate the longest common substring, which is much faster than the character-level LCS algorithm. This method is very inefficient and only applicable to the similar comparison of text files. It cannot be directly applicable to binary files. To this end, researchers proposed to extract a set of features for each document, so that the file similarity problem is converted into a set similarity problem, such as the shingle-based calculation method. The core idea of this method is to extract group feature values for each file and calculate similarity using feature set, thus reducing space and computing complexity to improve performance.
After analyzing the shingle algorithm, simhash algorithm, and the algorithm based on Bloom filter, the general process of similar data detection algorithms is as follows:
(1) data segments are divided into a group of shingle (that is, sub-sequences or data blocks), which can adopt fixed length, extended length, word or paragraph (Text File) and other block algorithms;
(2) to reduce the complexity of space and time computing, the shingle set can be sampled, such as the Min-wise, MODM, and mins methods;
(3) extract features from data files based on the selected shingle set. Generally, a sequence consisting of hash values is calculated for each shingle as the feature value;
(4) In order to reduce the complexity of space and time computing, we can perform dimensionality reduction on file features, such as simhash and bloom filter;
(5) Calculate the similarity between two Data Objects Based on file features. The calculation methods include cosine, overlap, dice, jaccard, and Hamming distance.
Shingle Algorithm
The core idea of the shingle algorithm is to convert the file similarity problem into the set similarity problem. The set similarity measurement methods mainly include resemblance and containment, which are defined as follows.
| Shingle (F1, W) 1_shingle (F2, W) |
RW (F1, F2) = ----------------------------------------------
| Shingle (F1, W) 1_shingle (F2, W) |
| Shingle (F1, W) 1_shingle (F2, W) |
CW (F1, F2) = ----------------------------------------------
| Shingle (F1, W) |
When the number is large, if similarity processing is performed on all shingle, the system overhead is large, including memory and CPU resources. In this case, we can consider sampling the shingle set to reduce the complexity of space and time computing. However, due to the limited sample coverage, the similarity accuracy will be reduced. There are three main shingle Sampling Methods: Min-wise, MODM, and mins. Min-wise is a public set that maps the shingle length W to an integer to generate a random hash. In the same mode, minwise performs random, minimum, and independent replacement sampling, the sampling set is obtained. The MODM technique selects the shingle corresponding to all hash values with M 0 in the same public ing set as Min-wise to form the sampling set; mins technology also maps shingle and integer sets, and then selects the minimum s elements to form a sampling set. In addition, the shingle hash value can be used to represent shingle for similarity calculation, which can save a certain amount of computing overhead.
Simhash Algorithm
The shingle algorithm has high space and time computing complexity and is difficult to apply to the simlarity join problem of large datasets. The core idea of Charikar's simhash algorithm is to use a B-bit hash value to represent the feature values of a file, and then use Hamming distance between simhash to measure similarity. Hamming distance is defined as the number of corresponding bits in two binary sequences. The simhash calculation method is as follows:
(1) initialize a vector V of the B dimension to 0, and the binary number S of the B bit to 0;
(2) calculate a B-bit signature h for each shingle using the hash function (such as MD5 and sha1. For I = 1 to B, if the I bit of H is 1, the I element of V adds the feature weight. Otherwise, the I-th element of V minus the feature weight;
(3) If the I-th element of V is greater than 0, the I-th element of S is 1; otherwise, it is 0;
(4) Output s as simhash.
Compared with traditional hash functions, simhash has such a notable feature that more similar files have more similar simhash values, that is, smaller Hamming distance. Obviously, simhash only uses B-bit hash values to represent file features, saving a lot of storage overhead; Hamming distance is simple and efficient in computing, and simhash uses Hamming distance to measure similarity, the complexity of computing is greatly reduced. In short, the simhash algorithm effectively solves the high space and time computing complexity issues of the shingle algorithm by Dimensionality Reduction of file features. However, the accuracy of the simhash algorithm is also diminished, and is related to the number of digits B in simhash. The larger the value of B, the higher the accuracy.
Bloom Filter Algorithm
Similar to the simhash algorithm, the core idea of the bloom filter algorithm is to reduce the dimension of file features. It uses the bloom filter data structure to represent feature values. Bloom filter is a data structure with high spatial efficiency. It consists of a single-digit group and a group of hash ing functions. Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time far exceed the average algorithm. Its disadvantage is that it has a certain false recognition rate and difficulty in deleting the element. Using bloom filter for similar data detection can make up for the overhead of high computing and storage space caused by file similarity calculation using feature set intersection in shingle, and strike a balance between performance and similarity matching accuracy. The bloom filter construction method is as follows:
(1) construct an M-bit bloom filter data structure BF and set all bits to 0 initially;
(2) select two hash functions as ing functions: hash1 and hash2;
(3) apply hash1 and hash2 to each shingle, and apply the corresponding bit location 1 to BF;
(4) Output BF as the file feature value.
In this way, the similarity calculation of two files is converted into the Similarity Calculation of Two Bloom Filters, and the more similar files have a common 1 in their bloom filters. Because the bloom filter has a limited false recognition rate, the accuracy of similarity algorithms depends on the size of the bloom filter. The larger the value, the higher the accuracy, and the larger the storage space consumption. Bloom filter can also use Hamming distance to measure similarity or cosine, overlap, dice, and jaccard. Hamming distance has been defined previously. Here we will introduce the formula for calculating the last four methods.
DOT (x, y)
Cosine_sim (x, y) = -----------------
SQRT (| x |. | Y |)
DOT (x, y)
Overlap_sim (x, y) = -----------------
Min (| x |, | Y |)
2. Dot (x, y)
Dice_sim (x, y) = -----------------
| X | + | Y |
DOT (x, y)
Jaccard_sim (x, y) = ------------------------
| X | + | Y |-dot (x, y)
Where, dot (x, y) = Σ X [I]. Y [I], which is equivalent to the number of digits in the data structure of two Bloom Filters at the same time; | x | represents the number of digits in the data structure of the bloom filter. Similarity Calculation functions are as follows:
[CPP]View plaincopyprint?
- Static double bloom_sim (bloom * bloom1, bloom * bloom2)
- {
- Int I, R1, R2;
- Int C1 = 0, C2 = 0, comm = 0;
- Double SIM;
- For (I = 0; I <bloom_array_sz; I ++ ){
- R1 = bloom_check (bloom1, 1, I );
- R2 = bloom_check (bloom2, 1, I );
- If (R1 & R2 ){
- Comm ++;
- C1 ++;
- C2 ++;
- } Else {
- If (R1 ){
- C1 ++;
- }
- If (R2 ){
- C2 ++;
- }
- }
- }
- /* Similarity Measures */
- // SIM = comm/(SQRT (C1) * SQRT (C2);/* cosine */
- // SIM = comm/1.0/(C1 + C2-comm);/* jaccard */
- // SIM = comm * 2.0/(C1 + C2);/* dice */
- Sim = comm * 1.0/(C1 <C2? C1: C2);/* overlap */
- Return SIM;
- }
Comparison of three algorithms
Shingle algorithms have high space and computing complexity and similarity accuracy. It is suitable for applications with small data volumes and high precision requirements. The simhash and bloom filter algorithms are superior to the shingle Algorithm in terms of space consumption and computing complexity. However, the accuracy is somewhat deficient, depending on the length of simhash and the size of bloom filter. The length of simhash is usually 64-bit or 128-bit, which can basically meet the needs of applications and increase the number of digits as needed. The length of the bloom filter must be greater than that of simhash. we can estimate the length based on twice the maximum number of shingle, and the accuracy is better than that of simhash. Due to the collision of hash functions, simhash and bloom filter algorithms may be misjudged, that is, similar files may be identified as similar. To sum up, in terms of storage space consumption of file feature values, shingle> bloom filter> simhash; shingle <bloom filter <simhash for similarity calculation accuracy. The bloom filter algorithm is often used to detect similar data, but the similarity calculation of massive data sets often uses the simhash algorithm, which has a great advantage in computing performance, it is more suitable for mapreduce computing models.