Timespan: 1.22-1.23
Chun-ho ng, mingcao Ma, Tsz-yeung Wong, Patrick P. C. Lee, John C. S. Lui:Live deduplication storage of Virtual Machine images in an open-source cloud. Middleware 2011: 81-100 (GS: 3)
This article proposes livedfs, a de-duplicated file system. Livedfs is modified based on the Linux File System (ext3) with the deduplication function added. It also uses spatial localisty, prefetching of metadata, journaling and other features to improve performance and reliability. Livedfs can be used as a storage layer for cloud platforms (such as cloudstack). experiments in this article show that at least 40% of virtual machine image storage space can be reduced.
Livedfs is POSIX-compliant (which can be seamlessly integrated into the Linux-based open-source cloud platform). It is implemented as the kernal-space driver module and does not need to be modified or re-compiled.Source codeTo the Linux kernel. The author thinks their job is: "First work that addresses the practical deployment of live deduplication for VM image storage in an open-source cloud ".
1. deduplication (S1)
The definition in this article: eliminates redundant data blocks by creating smaller-size pointers to reference an already stored data block that has identical content.
It is mentioned that the data blocks of Virtual Machine images of different versions of the same Linux distribution (Linux distribution) are highly repetitive.
2. Introduce de-duplication to virtual machine image storage. The following requirements must be taken into account (S1 ):
- Impact on the Performance of virtual machines, such as whether or not the VM startup is affected.
- For general file system operations, such as modification and deletion, the existing deduplication Technology for backup storage does not support deletion.
- Hardware requirements (originally "compatibility with low-cost commodity Settings") should support cheaper hardware devices and ensure I/O performance while supporting deduplication.
3. (S2) introduced the design concept of livedfs, the main idea is relatively simple: ext3fs, each file is assigned an inode, inode stores a group of block numbers (I .e. the block address). If the content of the two blocks is the same, livedfs sets the block to point to the same block to achieve deduplication (s2.1 ).
To achieve the above implementation, we must solve the following problems: (1) how to determine whether the two blocks have the same content? (Using fingerprints and using MD5 or other hash methods) (2) How to support deletion? (Each block has a reference count)
4. (s2.2) detailed the fingerprint design, which requires consideration of performance (it is impossible to put all fingerprint values in the memory ).
Fingerprint store and fingerprint filter are proposed in this paper: the former holds all fingerprints and stores them on the hard disk; the latter has memory, which is used to quickly find fingerprint
Based on the author's calculations, for 1 TB data, if the block size is 4 kb (for other parameters, see the article), it must be used to store the fingerprint filter within 2 GB.
5. (s2.3) introduced the fingerprint store prefetch to improve performance. (s2.4) introduced the journaling feature, which can improve reliability and also help write performance.
6. limitations or assumptions of livedfs
- Deployed in a single storage partition (S2)
- Only applied deduplication to the stored data within the same partition (S2)
- Mainly targets for VM image storage (S2)