[Dedup util]
Dedup util is an open-source lightweight file packaging tool. Based on block-level deduplication technology, dedup util can effectively reduce data capacity and save user storage space. At present, a project has been created on SourceForge, and the source code is constantly updated. The internal data department (layout) of the Data Packet Generated by this tool is as follows:
--------------------------------------------------
| Header | unique block data | file metadata |
--------------------------------------------------
A data packet consists of three parts: Header, unique block data, and file metadata ). The header is a struct that defines metadata such as the size of the data block, the number of unique data blocks, the size of the data block ID, the number of files in the package, and the position of metadata in the package. All unique data blocks are stored immediately after the file header. The size and quantity are indicated by the metadata in the file header. After a data block, the logical metadata of the file in the data packet is composed of multiple entities. The structure is as follows. An entity represents a file. When unpacking, data blocks are extracted one by one based on the metadata of the file to restore the original physical file.
Metadata representation of a logical file:
-----------------------------------------------------------------
| Entry header | pathname | entry data | last block data |
-----------------------------------------------------------------
The object header of a logical file records the object name length, number of data blocks, size of the data block ID, and size of the last data block. Next is the file name data, which is defined in the object header. After the file name data is stored, a group of unique data block numbers are stored. The numbers correspond to the data blocks in the unique data block set one by one. Finally, the last data block of the file is stored. because the size of the data block is usually smaller than that of the normal data block and the repetition probability is very small, it is stored separately.
For more information, see http://blog.csdn.net/liuben/archive/2010/01/09/5166538.aspx
Dedup util is currently in the pre-Alpha development stage. It supports file packaging, unpacking, append files, deleting files, and listing files in the package. The preliminary test results show that the dedup technology can significantly reduce the data volume of data packets even if it is not clear whether the data has a high repetition rate, the generated data packet is smaller than that of the TAR tool.
[Source code]
Project URL: https://sourceforge.net/projects/deduputil
SVN code library URL: https://deduputil.svn.sourceforge.net/svnroot/deduputil
[Compile]
1. Get source code
SVN Co https://deduputil.svn.sourceforge.net/svnroot/deduputil deduputil
2. Install libz-Dev
Apt-Get install libz-Dev
If apt-Get is not supported, install it in other ways.
3. Compile and install
./GEN. Sh
./Configure
Make
Make install
[Command line]
Usage: dedup [op
Tion...] [file]...
Dedup tool packages files with deduplicaton technique.
Examples:
Dedup-C foobar. ded Foo bar # create foobar. ded from files Foo and bar.
Dedup-A foobar. ded foo1 bar1 # append files foo1 and bar1 into foobar. ded.
Dedup-r foobar. ded foo1 bar1 # Remove files foo1 and bar1 from foobar. ded.
Dedup-T foobar. ded # list all files in foobar. ded.
Dedup-x foobar. ded # extract all files from foobar. ded.
Options:
-C, -- creat create a new archive
-X, -- extract extrace files from an archive
-A, -- append files to an archive
-R, -- remove files from an archive
-T, -- list files in an archive
-Z, -- compress filter the archive through zlib Compression
-B, -- block size for deduplication, default is 4096
-H, -- hashtable backet number, default is 10240
-D, -- Directory change to directory, default is pwd
-V, -- verbose print verbose messages
-H, -- help give this help list
[Operating platform]
Currently, it is only developed and tested on the Linux platform, and is not evaluated on other platforms.
[Todo]
1. Data block collision
Although the probability of MD5 collision is very small, there is still a possibility of a small probability event. Technical means are required to solve the collision problem, so as to ensure data security and give users peace of mind.
2. variable-length data blocks
At present, it is the implementation of fixed-length data blocks. Technically, it is relatively simple. A variable-length data block may obtain a higher data compression rate.
3. Identification of similar files
If there is only a small difference between the two files, such as inserting several bytes at a certain place, finding these data blocks and processing them separately may increase the data compression rate.
[Author]
Liu aigui, focus on storage technology, focus on data mining and distributed computing, Aigui.Liu@gmail.com
2010.06.02