The Ceph tutorial that does not speak crush is incomplete

Last Update:2018-08-27 Source: Internet

Author: User

Tags unique id

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As we mentioned earlier, Ceph is a distributed storage service that supports a unified storage architecture. A brief introduction to the basic concepts of ceph and the components that the infrastructure contains, the most important of which is the underlying rados and its two types of daemons, OSD and Monitor. We also dug a hole in the previous article and we mentioned crush.

Yes, our tutorial is an incomplete ceph textbook because we are talking about the crush not involving its algorithms and implementation principles, we are talking about Ceph's overall addressing process and taking a deeper look at the operational flow of data in Ceph.

This is Ceph's addressing process, you can see inside the main sub-layer four, File->objects->pgs->osds. Many students will ask, this what objects,pgs is what the term ah, we first look at ha.

File: We want to store and access the files, this is for our users, is our intuitive operation of the object.

object: This is what the Ceph bottom Rados sees, which is the basic unit stored in Ceph. The size of the object is limited by Rados (usually 2m or 4m). Just as HDFs abstracts a block of data, it is also designed to facilitate the organization of the underlying storage. When file is too large, file is cut into a uniform size of objects to store.

PG (Placement Group):P G is a logical concept whose purpose is to map the storage of object to the organization and location, through which it can better allocate data and locate data.

OSD (Object Storage Device): This is what we've described earlier, which is really the service that is responsible for data access.

PG and object are a one-to-many relationship, and a PG organizes several objects inside, but an object can only be mapped to a single pg.

PG and OSD are many-to-many relationships, one PG is mapped to multiple OSD (greater than or equal to 2, here is the copy mechanism), and each OSD also hosts a large number of PG.

Having learned some of the basic concepts above, it is time to explain to our addressing process that through the addressing flowchart we can see that the addressing in Ceph needs to undergo three mappings, namely FILE->OBJECT,OBJECT->PG,PG->OSD. The crush we have highlighted is the third-step mapping pg->osd. Let's look at it in turn.

File->object

This is a very simple step, which is to split the file into multiple object. Each object has a unique ID, the OID. How this OID is produced is based on the name of the file.

The Ino in the graph is the file unique ID (such as Filename+timestamp), Ono is the ordinal of an object after slicing (such as 0,1,2,3,4,5, etc.), according to the size of the file we will get a series of OID.

Note: Splitting a file into an object of the same size can be efficiently managed by Rados, and the processing of a single file can be turned into a parallelization process to improve processing efficiency.

PG, Object

The job here is to map each object to a PG, and the implementation is simple to hash the OID and then bitwise and calculate the ID of a pg. The mask in the figure is reduced by 1 for the number of PG. Here we think the resulting pgid is random, which is related to the number of PG and the number of documents. The data is evenly distributed to a sufficient level of magnitude.

PG-and OSD

The last mapping is to map the PG where the object resides to the actual storage location on the OSD. Here is the crush algorithm, through the crush algorithm can be pgid to get multiple OSD (with configuration).

Because we don't talk too much about how crush is done, we can think of a different way of thinking about what crush is doing. If we don't have to crush with a hash? We also apply the above formula hash (pgid) & mask = Osdid can be implemented?

If we also use the hash algorithm to generate Osdid, if the number of our OSD has changed, then the value of mask will change, we will eventually get the value of Osdid change. This means that my current PG location has changed, and the data under the PG needs to be migrated to the other OSD, which is certainly not feasible. While Ceph is a multi-copy backup mechanism, the PG should be mapped to multiple OSD, and only one can be obtained by means of hashing. So there's a need for crush, crush can dynamically get osdid based on the OSD status and Storage policy configuration of the cluster, thus automating the realization of high reliability and uniform data distribution.

The detailed implementation of Crush also requires reference to Sage Weil's paper.

Now we have a simple understanding of the three mapping, we can see the whole process we only know the file name and file size information, and did not go to query the location of the file, and so on, are calculated by calculation. The monitors we mentioned in the previous article (providing metadata service storage) is actually just maintaining the state information of some services throughout the cluster, also known as Clustermap. The data is in which OSD is calculated by the crush algorithm. So the meta-data service here is not the same as HDFs's namenode. The Namenode keeps the exact location of each block. So how does ceph do it, actually because of the logic layer PG and crush algorithm. Understanding Ceph's addressing process is no stranger to Ceph's data reading and writing process.

It's time to look at Sage Weil's paper.

Reference:
Ceph Official documentation
Detailed process for ceph storage data (CRUSH)

Welcome to follow me: three King data (unstable continuous update ~ ~ ~)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More