Data similarity Detection Algorithm

Source: Internet
Author: User

1. Introduction
The article "Data Synchronization Algorithm Research" describes how to synchronize data efficiently on the network. The premise is that files A and B are very similar, that is, there is a large amount of identical data between the two. If the similarity between the two files is very low, although this method can still work normally, the Data Synchronization performance will not be improved or even decreased. This results in consumption of some metadata and network communication, which is especially evident when the two files are completely unrelated. Therefore, before data synchronization, you must calculate the similarity between the seed file and the target file. If the similarity is greater than the specified threshold (usually greater than 50%), the data synchronization algorithm is applied, otherwise, transfer the file. In this way, the data synchronization algorithm has better adaptability and can perform high-performance data synchronization when data has different similarity. In addition, based on data similarity detection, data encoding (such as Delta encoding) can be performed for highly similar data, and data compression can be performed by one file to another file, this is a deduplication Technology Based on similar data Detection and encoding.
2. Similarity Calculation
UNIX diff compares documents one by one to detect similar files. It uses the classic LCS (longest common subsequence, Longest Common substring) algorithm and uses dynamic programming to calculate similarity. The meaning of LCS is the longest Character Sequence simultaneously contained in the string. The length of LCS is used as a measure of the similarity between the two strings. The diff algorithm uses the entire line as a "character" to calculate the longest common substring, which is much faster than the character-level LCS algorithm. This method is very inefficient and only applicable to the similar comparison of text files. It cannot be directly applicable to binary files.

Currently, the common practice is to convert file similarity to set similarity, such as shingle-based calculation and bloom filter-based calculation, both methods can apply to data files in any format. The core idea of this method is to extract group feature values for each file and calculate similarity with feature set, thus reducing computational complexity and improving performance. Shingle uses feature value intersection to calculate similarity, which leads to high computing and space overhead. Bloom filter technology is more promising in terms of computing overhead and matching accuracy. The Set element defined by Bloom filter is the fingerprint value of the data block split by the file according to the content-defined chunking algorithm. Its similarity is defined as follows:
| Fingerprints (F1) effecfingerprints (F2) |
SIM (F1, F2) = --------------------------------------------- (Formula 1)
| Fingerprints (F1) effecfingerprints (F2) |

Another method is to partition the binary file, use the data block fingerprint to represent the data block, and then map the data block to a "character ", then, the LCS algorithm is used to find the largest common substring and calculate the similarity. The similarity is defined as follows:
2 * length (LCS (fingerprints (F1), fingerprints (F2 )))
SIM (F1, F2) = ------------------------------------------------------------------ (formula 2)
Length (fingerprints (F1) + Length (fingerprints (F2 ))

The preceding two similarity algorithms use the data splitting technology. The data block length can be fixed or variable length. To improve the accuracy of similarity calculation, we can use the data block length as the weight calculation method.

3. Bloom Filter Algorithm
The file similarity calculation process is as follows:
(1) Use the CDC algorithm to split files into data blocks and calculate MD5 fingerprints for each data block;
(2) Calculate the intersection and union of two fingerprint sets through hashtable;
(3) Calculate file similarity according to formula 1, and consider duplicate data blocks and data block lengths to improve the accuracy of calculation.
For details, see the file_chunk, chunk_file_process and similarity_detect functions in the BSIM source code of the appendix.

4. LCS Algorithm
The file similarity calculation process is as follows:
(1) Use the CDC algorithm to split files into data blocks and calculate MD5 fingerprints for each data block;
(2) If the MD5 fingerprint string is mapped to a "character", the file is converted to a "string;
(3) Use the LCS algorithm to calculate the longest common substring and calculate its weighted length;
(4) Calculate file similarity according to formula 2, and consider duplicate data blocks and data block lengths to improve the accuracy of calculation.
For details, see the file_chunk, chunk_file_process, LCs, and similarity_detect functions in the BSIM source code in the appendix.

5. Algorithm Analysis and Comparison
Both algorithms split files. If F1 is split into M Blocks, F2 is split into N blocks. The bloom filter algorithm does not consider the data block sequence. Therefore, the similarity accuracy is lower than that of the LCS algorithm, and its time and space complexity are O (m + n ). On the contrary, the LCS algorithm takes into account the block sequence problem, and the similarity measurement is relatively accurate. However, the time and space complexity is O (Mn), which greatly limits the application scale. In summary, the accuracy of the bloom filter algorithm is lower than that of the LCS algorithm, but the computing consumption is much smaller, and the performance and applicability are very good. LCS is suitable for accurate file similarity calculation. These files are usually relatively small and suitable for file similarity calculation within 50 MB. For deduplication and network data synchronization, the deduplication effect and performance are irrelevant to the data block sequence. Therefore, the data similarity calculated by the bloom filter algorithm is more suitable and the performance is higher.

Appendix: BSIM. C source code
(For the complete source code, see deduputil source code)
/* Copyright (c) 2010 aigui Liu <br/> * this program is free software; you can redistribute it and/or modify <br/> * it under the terms of the GNU General Public License as published by <br/> * the Free Software Foundation; either version 3 of the license, or <br/> * (at your option) any later version. <br/> * this program is distributed in the hope that it will be useful, <br/> * B Ut without any warranty; without even the implied warranty of <br/> * merchantability or fitness for a participant purpose. see the <br/> * GNU General Public License for more details. <br/> * You shocould have written ed a copy of the GNU General Public License along <br/> * with this program; if not, visit the http://fsf.org website. <br/> */<br/> # include <stdio. h> <br/> # include <stdlib. h> <B R/> # include <string. h> <br/> # include <sys/types. h> <br/> # include <sys/STAT. h> <br/> # include <fcntl. h> <br/> # include <unistd. h> <br/> # include "hashtable. H "<br/> # include" Sync. H "<br/> # define neither 0 <br/> # define up 1 <br/> # define left 2 <br/> # define up_and_left 3 <br/> # define max (X, y) (x)> (y ))? (X): (y) <br/> # define min (x, y) (x) <(y ))? (X): (y) <br/> # define md5_len17 <br/> Enum {<br/> file1 = 0, <br/> file2 <br/> }; <br/> Enum {<br/> lcs_not = 0, <br/> lcs_yes <br/>}; <br/> typedef struct {<br/> uint32_t NR1; <br/> uint32_t Nr2; <br/> uint32_t Len; <br/>} hash_entry; <br/> typedef struct {<br/> char ** STR; <br/> uint32_t Len; <br/>} lcs_entry; <br/> static uint32_t sim_union = 0; <br/> static uint32_t sim_intersect = 0; <br/> static void u Sage () <br/>{< br/> fprintf (stderr, "Usage: BSIM file1 file2 chunk_algo LCS/n"); <br/> fprintf (stderr, "similarity detect between file1 and file2 Based on block level. /n "); <br/> fprintf (stderr," chunk_algo:/N "); <br/> fprintf (stderr, "FSP-fixed-size partition/N"); <br/> fprintf (stderr, "CDC-content-defined chunking/N"); <br/> fprintf (stderr, "SBC-slide block chunking/n"); <br/> fprintf (stde RR, "LCS:/N"); <br/> fprintf (stderr, "lcs_not-do not use LCS (longest lommon subsequence) algorithms/N "); <br/> fprintf (stderr, "lcs_yes-use LCS algorithms/n"); <br/> fprintf (stderr, "Report Bugs to <Aigui.Liu@gmail.com>. /n "); <br/>}< br/> static int parse_arg (char * argname) <br/>{< br/> If (0 = strcmp (argname, "FSP") <br/> return chunk_fsp; <br/> else if (0 = strcmp (argname, "CDC") <br/> retur N chunk_cdc; <br/> else if (0 = strcmp (argname, "SBC") <br/> return chunk_sbc; <br/> else if (0 = strcmp (argname, "lcs_not") <br/> return lcs_not; <br/> else if (0 = strcmp (argname, "lcs_yes") <br/> return lcs_yes; <br/> else <br/> return-1; <br/>}< br/> static char ** alloc_2d_array (INT row, int col) <br/>{< br/> int I; <br/> char * P, ** pp; <br/> P = (char *) malloc (row * Col * sizeof (char); <br/> P P = (char **) malloc (row * sizeof (char *); <br/> If (P = NULL | pp = NULL) <br/> return NULL; <br/> for (I = 0; I <row; I ++) {<br/> PP [I] = P + Col * I; <br/>}< br/> return pp; <br/>}< br/> static void free_2d_array (char ** Str) <br/>{< br/> free (STR [0]); <br/> free (STR); <br/>}< br/> static void show_md5_hex (unsigned char md5_checksum [16]) <br/> {<br/> int I; <br/> for (I = 0; I <16; I ++) {<br /> Printf ("% 02x", md5_checksum [I]); <br/>}< br/> printf ("/N "); <br/>}< br/> static int chunk_file_process (char * chunk_file, hashtable * htab, int which, int sim_algo, lcs_entry * le) <br/>{< br/> int FD, I, ret = 0; <br/> ssize_t rwsize; <br/> chunk_file_header chunk_file_hdr; <br/> chunk_block_entry chunk_bentry; <br/> hash_entry * He = NULL; <br/>/* parse chunk file */<br/> FD = open (chunk_file, o_rdonl Y); <br/> If (-1 = FD) {<br/> return-1; <br/>}< br/> rwsize = read (FD, & chunk_file_hdr, chunk_file_header_sz); <br/> If (rwsize! = Chunk_file_header_sz) {<br/> ret =-1; <br/> goto _ chunk_file_process_exit; <br/>}< br/> If (sim_algo = lcs_yes) {<br/> le-> STR = alloc_2d_array (chunk_file_hdr.block_nr, md5_len); <br/> If (le-> STR = NULL) {<br/> ret =-1; <br/> goto _ chunk_file_process_exit; <br/>}< br/> le-> Len = chunk_file_hdr.block_nr; <br/>}< br/> for (I = 0; I <chunk_file_hdr.block_nr; I ++) {<br/> rwsize = read (FD, & chunk_bentr Y, chunk_block_entry_sz); <br/> If (rwsize! = Chunk_block_entry_sz) {<br/> ret =-1; <br/> goto _ chunk_file_process_exit; <br/>}< br/> He = (hash_entry *) hash_value (void *) chunk_bentry.md5, htab); <br/> If (He = NULL) {<br/> He = (hash_entry *) malloc (sizeof (hash_entry); <br/> He-> NR1 = He-> Nr2 = 0; <br/> He-> Len = chunk_bentry.len; <br/>}< br/> (which = file1 )? He-> NR1 ++: He-> Nr2 ++; <br/>/* insert or update hash entry */<br/> hash_insert (void *) strdup (chunk_bentry.md5), (void *) He, htab); <br/> If (sim_algo = lcs_yes) {<br/> memcpy (le-> STR [I], chunk_bentry.md5, md5_len); <br/>}< br/> _ chunk_file_process_exit: <br/> close (FD); <br/> return ret; <br/>}< br/> uint32_t LCS (char ** A, int N, char ** B, int M, hashtable * htab) <br/>{< br/> int ** s; <br/> in T ** R; <br/> int II; <br/> int JJ; <br/> int Pos; <br/> uint32_t Len = 0; <br/> hash_entry * He = NULL; <br/>/* Memory Allocation */<br/> S = (INT **) malloc (n + 1) * sizeof (int *); <br/> r = (INT **) malloc (n + 1) * sizeof (int *)); <br/> If (S = NULL | r = NULL) {<br/> perror ("malloc for S and R in LCS "); <br/> exit (0); <br/>}< br/> for (II = 0; II <= N; ++ II) {<br/> S [II] = (int *) malloc (m + 1) * Sizeof (INT); <br/> r [II] = (int *) malloc (m + 1) * sizeof (INT )); <br/> If (s [II] = NULL | r [II] = NULL) {<br/> perror ("malloc for S [II] and R [II] in LCS"); <br/> exit (0 ); <br/>}< br/>/* It is important to use <=, not <. the next two for-loops are initialization */<br/> for (II = 0; II <= N; ++ II) {<br/> S [II] [0] = 0; <br/> r [II] [0] = up; <br/>}< br/> for (JJ = 0; JJ <= m; ++ JJ) {<B R/> S [0] [JJ] = 0; <br/> r [0] [JJ] = left; <br/>}< br/>/* this is the main dynamic programming loop that computes the score and */<br/>/* backtracking arrays. */<br/> for (II = 1; II <= N; ++ II) {<br/> for (JJ = 1; JJ <= m; + + JJ) {<br/> If (strcmp (A [II-1], B [jj-1]) = 0) {<br/> S [II] [JJ] = s [II-1] [jj-1] + 1; <br/> r [II] [JJ] = up_and_left; <br/>}< br/> else {<br/> S [II] [JJ] = s [II-1] [jj-1] + 0; <br/> r [II] [JJ] = neither; <br/>}< br/> If (s [II-1] [JJ]> = s [II] [JJ]) {<br/> S [II] [JJ] = s [II-1] [JJ]; <br/> r [II] [JJ] = up; <br/>}< br/> If (s [II] [jj-1]> = s [II] [JJ]) {<br/> S [II] [JJ] = s [II] [jj-1]; <br/> r [II] [JJ] = left; <br/>}< br/>/* the length of the longest substring is s [N] [m] */<br/> II = N; <br/> JJ = m; <br/> Pos = s [II] [JJ]; <br/>/* trace the Backtracking m Atrix. */<br/> while (II> 0 | JJ> 0) {<br/> If (R [II] [JJ] = up_and_left) {<br/> II --; <br/> jj --; <br/> // LCS [pos --] = A [II]; <br/> He = (hash_entry *) hash_value (void *) A [II], htab); <br/> Len + = (He = NULL )? 0: He-> Len); <br/>}< br/> else if (R [II] [JJ] = up) {<br/> II --; <br/>}< br/> else if (R [II] [JJ] = left) {<br/> jj --; <br/>}< br/> for (II = 0; II <= N; ++ II) {<br/> free (s [II]); <br/> free (R [II]); <br/>}< br/> free (s ); <br/> free (r); <br/> return Len; <br/>}< br/> int hash_callback (void * Key, void * Data) <br/> {<br/> hash_entry * He = (hash_entry *) data; <br/> sim_union + = (HE-> Len * (He-> NR1 + He-> Nr2); <br/> sim_intersect + = (HE-> Len * min (HE-> NR1, he-> Nr2 )); <br/>}< br/> static float similarity_detect (hashtable * htab, char ** str1, int N, char ** str2, int M, int sim_algo) <br/>{< br/> uint32_t lcs_len = 0; <br/> hash_for_each_do (htab, hash_callback); <br/> If (sim_algo = lcs_yes) {<br/> lcs_len = LCS (str1, N, str2, M, htab); <br/> return lcs_len * 2.0/sim_union; <br/>} else {/* Lcs_not */<br/> return sim_intersect * 2.0/sim_union; <br/>}< br/> int main (INT argc, char * argv []) <br/>{< br/> int chunk_algo = chunk_cdc; <br/> int sim_algo = lcs_not; <br/> char * file1 = NULL; <br/> char * file2 = NULL; <br/> lcs_entry le1, le2; <br/> char tmpname [name_max_sz] = {0 }; <br/> char template [] = "deduputil_bsim_xxxxxx"; <br/> hashtable * htab = NULL; <br/> int ret = 0; <br/> If (Ar GC <5) {<br/> usage (); <br/> return-1; <br/>}< br/>/* parse chunk algorithms */<br/> file1 = argv [1]; <br/> file2 = argv [2]; <br/> chunk_algo = parse_arg (argv [3]); <br/> sim_algo = parse_arg (argv [4]); <br/> If (chunk_algo =-1 | sim_algo =-1) {<br/> usage (); <br/> return-1; <br/>}< br/> htab = create_hashtable (hashtable_bucket_sz); <br/> If (htab = NULL) {<br/> fprintf (stderr, "Create hashtabke fa Iled/N "); <br/> return-1; <br/>}< br/>/* chunk file1 and file2 into blocks */<br/> sprintf (tmpname, "/tmp/% S _ % d ", mktemp (Template), getpid (); <br/> ret = file_chunk (file1, tmpname, chunk_algo); <br/> If (0! = RET) {<br/> fprintf (stderr, "chunk % s failed/N", file1); <br/> goto _ bencode_exit; <br/>}< br/> le1.str = NULL; <br/> ret = chunk_file_process (tmpname, htab, file1, sim_algo, & le1 ); <br/> If (Ret! = 0) {<br/> fprintf (stderr, "pasre % s failed/N", file1); <br/> goto _ bencode_exit; <br/>}< br/> ret = file_chunk (file2, tmpname, chunk_algo); <br/> If (0! = RET) {<br/> fprintf (stderr, "chunk % s failed/N", file2); <br/> goto _ bencode_exit; <br/>}< br/> le2.str = NULL; <br/> ret = chunk_file_process (tmpname, htab, file2, sim_algo, & le2 ); <br/> If (Ret! = 0) {<br/> fprintf (stderr, "pasre % s failed/N", file2); <br/> goto _ bencode_exit; <br/>}< br/> fprintf (stderr, "similarity = %. 4f/N ", similarity_detect (htab, le1.str, le1.len, le2.str, le2.len, sim_algo); <br/> _ bencode_exit: <br/> unlink (tmpname ); <br/> hash_free (htab); <br/> If (le1.str) free_2d_array (le1.str); <br/> If (le2.str) free_2d_array (le2.str ); <br/> return ret; <br/>}< br/>

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.