Talking about Ceph Erasure code

Last Update:2016-08-04 Source: Internet

Author: User

Tags erasure coding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Directory
Chapter 1th Introduction
1.1 Document Description
1.2 Reference documents
The 2nd chapter the concept and principle of erasure code
2.1 Concepts
2.2 Principle
3rd Chapter Introduction of CEPH Erasure code
3.1 Ceph Erasure code use
3.2 Ceph Erasure code Library
3.3 Ceph Erasure code data storage
3.3.1 Encoding block reading and writing
3.3.2 interrupted full write
3.4 Scope of Use
3.4.1 Cold Data
3.4.2 inexpensive multi-data center storage
4th chapter of Ceph Erasure code example
4.1 Data Read and write
4.2 Erasure code Pool does not support some functions
4.3 Erasure code profile
4.4 Ceche tier to compensate for erasure defects
5th. Ceph erasure code and cache layering

Chapter 1th Introduction
1.1 Document Description
The Ceph erasure code is introduced and includes an example.
1.2 Reference documents
Ceph Official documentation
Http://ceph.com/docs/master/architecture/#erasure-coding
http://ceph.com/docs/master/rados/operations/erasure-code/
http://ceph.com/docs/master/dev/erasure-coded-pool/
Redhat's Inktank ceph storage new erasure code, tiered deployment
Http://www.searchstorage.com.cn/showcontent_83783.htm
Erasure code: Ensure data availability after RAID failure
Http://storage.it168.com/a2011/0816/1233/000001233286_1.shtml
Erasure codes in the storage System (Erasure Codes)
Http://www.tuicool.com/articles/v6Bjuq

The 2nd chapter the concept and principle of erasure code
2.1 Concepts
According to the different functions of error control, can be divided into error detection code, error correction code and erasure code.
The error detection code only has the function of identifying the wrong code without correcting the error code;
Error correcting code not only has the function of identifying the wrong code, but also has the function of correcting the wrong code;
The erasure code not only has the function of identifying the wrong code and correcting the wrong code, but also can delete the error-correcting information when the error code exceeds the correction range.
2.2 Principle
Like K=3 m=2 k+m=5.
This means:
K the number of original data disks or the number of disks required to recover data
Number of M check disks or number of disks allowed to fail
Using the encoding algorithm to generate k+m new data from K raw data
The original K data can be restored by any k new data
That allows the M data disk to fail, the data will still not be lost;

3rd Chapter Introduction of CEPH Erasure code
Ceph Erasure code is ceph Erasure.
3.1 Ceph Erasure code use
Save space by using less space to achieve storage;
The erasure code realizes the high-speed computation, but has 2 disadvantage, one is the slow speed, one is only supports the object the partial operation (for example: does not support the partial write). The shortcomings of the erasure codes have now been resolved.
3.2 Ceph Erasure code Library
Ceph's default erasure code library is Jerasure, which is the Jerasure library;
When an administrator creates a erasure-coded backend, you can specify the data block and code block parameters.
The Jerasure Library is a middleware provided by third parties.
Jerasure Library can be found on the outside web.
The Jerasure library is already installed by default when the CEPH environment is installed.
3.3 Ceph Erasure code data storage
In the erasure coded pool, each data object is stored in a k+m block. The objects are divided into K-databases and M-coded blocks; the size of the erasure coded pool is defined as a k+m block, each block is stored in an OSD, and the sequence number of the block is saved in the object as an attribute for object.
3.3.1 Encoding block reading and writing
For example: Create a erasure coded pool of 5 Osds (k=3 m=2), allowing damage of 2 (M = 2);
Object NYAN content is Abcdefghi;
Nyan Write to the pool, the Erasure code function Nyan 3 data blocks: 1th is ABC, 2nd is Def, 3rd is ghi; if the length of Nyan is not a multiple of K, Nyan will be filled with some content;
The Erasure code function also creates 2 blocks of code: The 4th is Yxy, and the 5th is GQC;

Each block is stored in the OSD, and the block in the object has the same name (NYAN) but is stored in a non-pass OSD. In addition to the name, these blocks have an ordinal number that needs to be saved in the object's properties (shard_t)
For example, Block 1 contains ABC saved in OSD5, and block 4 contains yxy saved in OSD3.
When reading an object Nyan from the erasure coded pool, the Erasure code function reads 3 blocks: Block 1 (ABC)/Block 3 (GHI)/Block 4 (YXY), and then reconstruct the original object content Abcdefghi;
The Erasure code function is told that Block 2 and block 5 are missing; Block 5 cannot be read because the OSD4 is damaged; Block 3 cannot be read because the OSD2 is too slow.

3.3.2 interrupted full write
In the erasure coded pool, the main OSD is responsible for all writes, and it is responsible for encoding the K+M block and writing it to the other OSD. It is also responsible for maintaining an authoritative version of the PG log.
, a erasure coded is a k=2/m=1,3 OSD node, 2 are the nodes of K, and one is the node of M; PG is in OSD1/OSD2/OSD3 respectively;
An object is encoded and saved in these OSD:
Block D1v1 (data Block 1, version 1) in OSD1;
Block D2v1 (data Block 2, version 1) in OSD2;
Block C1v1 (code block 1, version 1) in OSD3;
The PG log in each OSD is consistent (the Epoch 1, version 1);

OSD1 is the master node, receiving the client's write full request, which is to rewrite all objects instead of partial replacements;
The version 2 (V2) object was created to replace the version 1 (V1) object;
The master node OSD1 is responsible for encoding writing 3 blocks:
Block D1v2 (data block 1, version 2) in OSD1;
Block D2v2 (data Block 2, version 2) in OSD2;
Block C1v2 (code block 1 version 2) in OSD3;
Each block is written to the target OSD, including the main OSD node, the main OSD node is responsible for storing blocks, and is responsible for maintaining an authoritative PG log version;
When the OSD is connected to a write block instruction, it also creates a PG log as a response;
For example, as long as OSD3 stores c1v2, it adds an entry (i.e. epoch 1, version 2) to log;
Because the OSD work is asynchronous, some blocks may still be writing (such as d2v2), but the other blocks have already been written and returned accordingly (such as C1V1 and D1V1).

If all is well, every block on the OSD is written, then the last_complete pointer of log is changed from 1 to 2.

Finally, files that save previous versions of the block will be deleted: D1v1 on OSD1, D2v1on OSD2 and c1v1 on OSD3.

But accidents can happen sometimes. If OSD1 is corrupted and d2v2 is still writing, version 2 of object is a local write: OSD3 has one block but not enough to recover the other blocks. 2 blocks lost: D1v2 and D2v2, but the erasure coding parameter is k=2/m=1, requiring at least 2 blocks to be available to rebuild the 3rd block. At this point, OSD4 becomes the master node and discovers that the Last_complete log record is the first, which will be the new authoritative record header.

The log record in node OSD3 is inconsistent with the new authoritative log record in node OSD4: Log in OSD3 is discarded and c1v2 block is deleted. The D1V1 block is rebuilt (rebuilt with the Erasure code function at scrubbing) and saved to the new master node OSD4.

3.4 Scope of Use
3.4.1 Cold Data
1, the main storage of more than 1G objects, such as mirrors, images, etc., these data 10% are read once a month;
2, new objects are added daily, but these objects are not modified after adding;
3, the data, the average read 10,000 times, write once.
4. Create a replicated pool as the Ceche hierarchy of the erasure coded pool; When an object is not accessed for a week, you can demote the object (move the object from the replicated pool to the erasure-coded pool) ; Of course it can be adjusted to the contrary;
5, erasure-coded pool for cold data design, suitable for slow hardware equipment, access to less data; replicated pool is designed for fast devices and fast access.
3.4.2 inexpensive multi-data center storage
10 dedicated network-linked data centers, each with the same size of storage space, but no power backup and no air cooling system.
Create such a erasure-coded pool, while allowing 3 nodes to be corrupted while the data is not lost, the data block is 6 (k=6), and the encoding block is 3 (m=3); its cost is 50%
The same replication pool is created with a cost of 400% (4 replicas);

4th chapter of Ceph Erasure code example
Ceph pool to ensure that some of the OSD damage data is not lost (in general, a disk is set to an OSD); By default, the rule type chooses replicated when the pool is created, that is, object is copied in multiple disk save The other rule type of pool is erasure, which saves space;
The simplest erasure coded pool is equivalent to RAID5, requiring at least 3 nodes; that is, k=2 m=1 the default erasure is the case
$ ceph OSD Pool Create Ecpool Erasure
Where: 18 is Pgp_num 12 is pg_num (Pgp_num must be greater than pg_num)
4.1 Data Read and write
Read and write String Abcdefghi
$ echo Abcdefghi | Rados--pool Ecpool put NYAN-
$ rados--pool Ecpool Get NYAN-
Read and write Files Test.txt
Rados-p Ecpool put test test.txt
Rados-p Ecpool Get test file.txt
4.2 Erasure code Pool does not support some functions
For example, "partial write" is not supported.
RBD creation of mirrors is not supported
# RBD Create xxx-p ecpool--size 1024
Rbd:create Error: (operation) not supported
Librbd:error adding image to directory: [Operation] not supported
# RBD Import Securecrt5.rar securecrt5-p Ecpool
Rbd:image creation failed
Importing image:0% complete...failed.
Librbd:error adding image to directory: [Operation] not supported
Rbd:import failed: (operation) not supported
4.3 Erasure code profile
1. Default profile
The default erasure code profile allows for an OSD corruption, which is equivalent to the replicated pool of 2 backup nodes, and the equivalent of erasure pool uses 1.5TB instead of replicated pool to store 1TB data 2TB.
$ ceph OSD Erasure-code-profile ls
$ ceph OSD Erasure-code-profile get default
The results are shown below, with the smallest erasure pool type
Directory=/usr/lib/ceph/erasure-code
k=2
M=1
Plugin=jerasure
Technique=reed_sol_van
2. Add profile
It is important to select the appropriate profile when creating a pool, since pool creation cannot be modified, and if the profile of the two pool is different, all objects are moved from the old pool to the new pool when a new pool is created.
The most important parameters of profile are k/m and Ruleset-failure-domain, as they define the storage overhead and data persistence.
For example: We want to build a structure that allows 2 disk corruption and storage overhead loss of 40%, so we can create a profile
$ ceph OSD Erasure-code-profile set myprofile \
k=3 \
m=2 \
Ruleset-failure-domain=rack
Note: Ruleset-failure-domain has OSD, host, chassis, rack, row and other options
Ruleset-failure-domain=rack says: The crush rule ensures that 2 blocks are not stored on the same rack rack.
3. Create erasure pool according to profile
$ ceph OSD Pool Create Ecpool Erasure myprofile
$ echo Abcdefghi | Rados--pool Ecpool put NYAN-
$ rados--pool Ecpool Get NYAN-
4. Delete profile
$ ceph OSD Erasure-code-profile RM profile
4.4 Ceche tier to compensate for erasure defects
Erasure pool requires more resources and lacks functionality (such as partial write) compared to replicated pools, and it is recommended to add a Ceche tier layer on the erasure pool to overcome these limitations. Ceche tier can solve the problem of erasure pool missing function, but also can solve the problem of low erasure pool performance.
This is Radhat's ice now hype the erasure code, storage tiering.
Suppose Hot-storage is a fast storage pool, which is a fast replicated pools. The specific commands are as follows:
$ ceph OSD Pool Create Ecpool erasure (this is our erasure pool k=2 m=1)
$ ceph OSD Pool Create Hot-storage 128 (This is our cache tier, it's high speed)
$ ceph OSD Tier add Ecpool hot-storage
$ ceph OSD Tier Cache-mode hot-storage writeback
$ ceph OSD Tier Set-overlay Ecpool hot-storage
The Hot-storage pool is used as the tier of the Ecpool in the writeback mode, so that the ecpool read and write actually uses the Hot-storage pool and can use the flexibility and speed of the Hot-storage pool.
Because Erasure does not support partial write, RBD image cannot be created in Ecpool, and after setting up the tier hierarchy pool of erasure pool, you can create Ecpool image in RBD. (You cannot create an image with RBD without adding a ceche tier, as mentioned in the front).
RBD--pool ecpool Create--size Myvolume
RBD Import 1.txt 1.txt-p Ecpool
RBD ls-p Ecpool
Description: Operation Ecpool and hot-storage the same effect, but the actual data storage location, depending on the situation: 1 weeks or more do not use, then stored in Ecpool, often used, then stored in the hot-storage.

5th. Ceph erasure code and cache layering
The erasure code and cache layering are two closely linked functions. These 2 features are a feature that Redhat has been paying attention to after acquiring Ceph.
The erasure code improves the storage capacity, but reduces the speed, while the cache layering solves the problem;
The principle structure is as follows:

Talking about Ceph Erasure code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More