File similarity determination-Super feature)

Source: Internet
Author: User

Content-Based variable-length block (CDC) technology can be used for variable-length blocks of files, and then for repetitive detection, which is widely used in deduplication systems. Later, Delta compression was performed on similar data blocks to further save storage costs. Therefore, an efficient similarity detection algorithm is required. The super-features algorithm proposed by Wan optimized replication of backup datasets using stream-informed Delta compression has good results. The main idea is to use Rabin fingerprint in a window to obtain a random value in the process of Sliding Window blocks, if it is larger than the Rabin fingerprint of all windows W in this block, it is recorded as a feature value feature-I. Multiple feature obtained through this method, the super feature SF is obtained by calculating the Rabin fingerprint, and each SF has four feature values.

The following is a simple test result for several files. Here, each file generates two super feature values (if the two files have the same super feature, they can be considered highly similar ), the effect is better than simhash (lack of a large number of dataset arguments ).
F1, F2, and F3 add additional bytes to the header, tail, and middle of F respectively. We found that the two super feature values are the same. supfeature [0] = 5465959093573163876, supfeature [1] = 7673021043978770954. F4 is a completely different file. supfeature [0] = 2682386775420212619, supfeature [1] = 3509276326591445061.
Reference: 1. Philip shilane-Wan optimized replication of backup datasets using stream-informed Delta compression2.some applications of Rabin's fingerprinting method.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.