Ceph's Crush algorithm example

Source: Internet
Author: User

[Email protected]:~# ceph OSD Tree

# id Weight type name up/down reweight

-1 0.05997 Root Default

-2 0.02998 Host Osd0

1 0.009995 Osd.1 up 1

2 0.009995 Osd.2 up 1

3 0.009995 Osd.3 up 1

-3 0.02998 Host OSD1

5 0.009995 Osd.5 up 1

6 0.009995 Osd.6 up 1

7 0.009995 Osd.7 up 1

Storage node

Before you go any further, consider this: Ceph is a distributed storage system, regardless of the details of its "distributed logic", where the data is ultimately stored on the device.

There are two options: to manipulate the device directly, or by a local file system proxy. The former means to face the hard disk directly, how to organize the data on the hard disk is done by themselves. The latter means not directly facing the hard disk, but using the existing file system. Ceph uses the second way, the optional local file system has EXT4, Btrfs, and XFS, and we use EXT4.

Crush Algorithm

Imagine you are a user and you have an action movie to store in a ceph cluster. You come to the machine room with your notebook, and the Ceph cluster you see is a server of several racks. You're thinking, where is my action movie finally stored?

What you are considering is the problem of data positioning.

There are two common ways to locate data:

    1. Recording. Record information such as "Data A: Location (a)", query records when accessing data, obtain a location, and then read.
    2. Calculation. When storing and data A, its storage location (a) is calculated instantly. It is more convenient to feel this way.

The common calculation is a consistent hash (consistent hashing), Glusterfs uses this approach, the basic idea is that, in the face of data A, the file name of data A and other similar information as key, through a consistent hash calculation consist_hash (Keya) = Location (A) to be stored.

Ceph uses the crush algorithm: controlled, scalable, decentralized Placement of replicated Data. Crush is one of the ceph cores and this article will also focus on the description.

In short, crush also uses hashing to calculate the location, but it takes more advantage of the structure information of the cluster. Here's an example to try to understand.

The previous three usage scenarios of RBD, CEPHFS and RGW are based on the Rados layer. The Rados layer provides a librados interface, whereby it can implement its own tools. Ceph provides a program Rados by default and can upload an object directly to the Ceph cluster via Rados.

mon0# Rados put bigfile bigfile.data-p RBD//Upload data Bigfile.data as Object Bigfile

mon0# Rados ls-p RBD

Bigfile

The storage location of the bigfile is computed by crush, and Ceph provides commands to query the location of an object.

mon0# ceph OSD Map RBD bigfile

Osdmap e67 pool ' RBD ' (1) object ' Bigfile ', pg 1.a342bdeb (1.6B)-Up ([6,3], P6) acting ([6,3], P6)

[6,3] means bigfile This object is stored on Osd.6 and Osd.3 (ceph cluster deployment Reference), and placed under PG 1.6b, also placed under directory 1.6b.


Verify this:

osd0# ls/var/lib/ceph/osd/ceph-6/current/1.6b_head/-LH

Total 61M

-rw-r--r--1 root root 61M October 08:12 bigfile__head_a342bdeb__1//Feel the naming of object

PG (Placement Group)

The above example mentions PG, which is an intermediate layer of the Ceph crush data map.

The discussion of PG needs to mention that Pool,pool is a logical concept of ceph, where users can create several pool on a ceph cluster, set different properties, and then place different data in different pool as needed. For example, I have two kinds of data, one that only needs to store 2 backups, the other is more important, I need to store 3 backups, so I can create two pool, set up size=2 and size=3.

I have only one pool in my lab, named RBD (this is a random name, do not confuse with the RBD block Device usage scenario), and its backup number is 2 copies:

mon0# ceph OSD Dump | grep pool

Pool 1 ' RBD ' replicated size 2 min_size 2 Crush_ruleset 0 object_hash rjenkins pg_num all Pgp_num last_change Hashpspool Stripe_width 0

Ceph provides the concept of the placement group under the concept of pool and specifies the number of placement group by parameter pg_num. From the above output, I have 128 pg under my RBD pool.

PG actual corresponding directory, RBD pool has 128 PG, meaning that RBD pool is set to have 128 directories to hold data. The reason why the PG layer is added to manage the data is the convenience of data management and the reduction of meta-data information.

RBD pool128 PG, which OSD is placed on each PG, this is also calculated, and the computational process is part of the crush algorithm. The overall 128 pg is stored on all OSD, so on one OSD we don't see all 128 directories.

1.11_head 1.1f_head 1.27_head 1.33_head 1.39_head 1.42_head 1.4c_head 1.54_head 1.60_head 1.66_head 1.74_head 1 .79_head 1.7e_head 1.8_head 1.b_head 1.f_head nosnap

1.15_head 1.21_head 1.2e_head 1.36_head 1.3_head 1.44_head 1.4f_head 1.56_head 1.61_head 1.6a_head 1.75_head 1 .7b_head 1.7f_head 1.9_head 1.d_head commit_op_seq omap

1.1b_head 1.26_head 1.2f_head 1.37_head 1.41_head 1.45_head 1.50_head 1.5a_head 1.63_head 1.73_head 1.78_head 1 .7c_head 1.7_head 1.a_head 1.e_head Meta

Description: The PG is named ${pool_id}.${pg.id}_${snap},pgid is a 16 binary value, such as 1.a_head represents the 10th PG in 128 pg.

So the approximate logic of the crush data location algorithm is:

    • Step 1: Enter the object ID to calculate which pg it should be placed under and get pgid.
    • Step 2: Enter Pgid to calculate which OSD the PG is on.
    • Step 3: Access object.
Crush Calculation Example

We would like to build on this understanding to learn more about the code details, considering that Ceph does not seem to provide an interface to access the crush algorithm, We made some changes to the ceph0.86 code, let Librados provide the layout information of the Ceph cluster, and crush code out to make Libcrush, and then use the modified Librados and Libcrush write program to verify the crush algorithm steps.

    • Libcrush code and changes to ceph0.86 see: Https://github.com/xanpeng/libcrush
    • Crush algorithm Step verification code see CRUSH-TESTER.CC:HTTPS://GIST.GITHUB.COM/XANPENG/A41A25B5810CB2C8852C#FILE-CRUSH-TESTER-CC

crush-tester.cc Main () shows the steps for data positioning, which is logically equivalent to the "Ceph OSD map" logic. Note: The code is strongly dependent on the CEPH environment we deployed earlier.

int main (int argc, char **argv) {assert (argc = = 2); string objname = Argv[argc-1];//Assuming objname=bigfile, this step is calculated by objname A numeric value, and PG actually does not matter, but the official code is so named, we also do like naming//Bigfile->some_value here, the algorithm used here is ceph_str_hash_rjenkins, the algorithm here is not scrutiny, in fact, there is no need scrutiny. Its function is to be different objname mapping layer different numerical value, believed its characteristic is obtains the numerical conflict rate is very low. pg_t pg = OBJECT_TO_PG (object_t (objname), object_locator_t (g_pool_id)); printf ("Objectr_to_pg:%s,%x\n", Objname.c_str (), pg.seed); The value obtained by Bigfile is a342bdeb as input to this step. The effect of this step is to map this value to the specific PG as well as the object bigfile. The method used in fact is to take the mold, according to Pg_num to do modulus calculation, the corresponding function is Crush_stable_mod (a342bdeb, 127) =6b. pg_t mpg = RAW_PG_TO_PG (PG); printf ("Raw_pg_to_pg:%x\n", mpg.seed); This step is the core of the crush algorithm,//input is the pgid=6b, and the Ceph cluster layout, that is crushmap,//output is the storage location of 6b is which two OSD (because the number of backup settings size=2),// Crush the process of the algorithm in Libcrush, note that this link to the program "-lcrush-lrados", thus using Libcrush instead of Librados code crush, so that we intervene in Libcrush to understand the crush algorithm. The crush algorithm calculates the position of the 6b in the following procedure. printf ("pg_to_osds:\n"); Vector<int> up; PG_TO_UP_ACTING_OSDS (mpg, &up);  }

# g++ Crush_tester.cc-o Test_crush-lcrush-lrados--std=c++11-g-o0

#./test_crush Bigfile

Objectr_to_pg:bigfile-A342bdeb

raw_pg_to_pg:6b

PG_TO_OSDS:

osd_weight:0,65536,65536,65536,0,65536,65536,65536,//This output I find it strange that the feeling is wrong, but the result of the last crush mapping is correct

ruleno:0

placement_ps:1739805228

This is the crush algorithm calculates the output of the 6b position, note that one of the calculations is based on the crushmap of our experimental environment,

There are several ways to get the current Crushmap from a live ceph cluster, the crushmap of our environment see: https://gist.github.com/xanpeng/a41a25b5810cb2c8852c# File-ceph-env-txt

From the perspective of the crush algorithm, we can tell by crushmap that our environment is layered like this:

First layer: Root

ID=-1,ALG=STRAW,HASH=RJENKINS1, containing two item [OSD0, OSD1]

Second Level: Host

Host Osd0:id=-2, Alg=straw, HASH=RJENKINS1, contains three item [Osd.1, Osd.2, Osd.3]

Host osd1:id=-3, Alg=straw, HASH=RJENKINS1, which contains three item [Osd.5, Osd.6, Osd.7], is not 4,5,6, indicating that I was in the process of a bit of a mistake, but it is not a hindrance.

Layer Three: OSD

A total of 6 osd,id is [1,2,3,5,6,7], each OSD has a weight, by setting the OSD weight, affecting the data is stored in the current OSD preferences.

---start crush_do_rule---

Choose_leaf bucket-1 x 1739805228 outpos 0 numrep 2 tries wuyi recurse_tries 1 local_retries 0 local_fallback_retries 0 par Ent_r 0

The first step of the algorithm, starting from the first layer, to get the first data backup storage location, to determine which host I should enter the second tier.

Because of the alg=straw of the first layer, the straw algorithm is used to select the second layer Host,straw algorithm and the RJENKINS1 hashing algorithm is used.

Get the first backup storage location on item-3, i.e. on OSD1

Crush_bucket_choose-1 x=1739805228 r=0

Item-3 Type 1

The second step of the algorithm is to select an OSD in the third layer under OSD1,

The algorithm used is OSD1 this bucket set straw+rjenkins1,

This is the second hash in the process of the crush algorithm. Our experimental environment is simple, layout is only three layers, in a more hierarchical environment, crush need to do more times the hash.

Get the first backup storage location on OSD1 Osd.6

CHOOSE bucket-3 x 1739805228 outpos 0 numrep 1 tries 1 Recurse_tries 0 local_retries 0 local_fallback_retries 0 Parent_r 0

Crush_bucket_choose-3 x=1739805228 r=0

Item 6 Type 0

CHOOSE got 6

CHOOSE returns 1

CHOOSE got-3

The third step of the algorithm is to calculate the location of the second data backup and obtain the OSD0

Crush_bucket_choose-1 x=1739805228 r=1

Item-2 Type 1

The fourth step of the algorithm, in the same vein, the location of the second backup: Osd0 on the Osd.3

CHOOSE bucket-2 x 1739805228 outpos 1 numrep 2 tries 1 Recurse_tries 0 local_retries 0 local_fallback_retries 0 Parent_r 0

Crush_bucket_choose-2 x=1739805228 r=1

Item 3 Type 0

CHOOSE got 3

CHOOSE returns 2

CHOOSE got-2

CHOOSE returns 2

---finish crush_do_rule---


Numrep:2, Raw_osds: [6, 3,]

Perform multiple crush-tester and "Ceph OSD map" In contrast, confirming that the above understanding should be correct

Example of the crush algorithm for Ceph

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.