The Ceph Crush algorithm (controlled Replication under Scalablehashing) is an algorithm based on data distribution and replication for random control.
Basic principle:
Storage devices typically support stripe to increase storage system throughput and improve performance, and the most common way to stripe is to do raid. As RAID0.
The data is distributed in strips on the hard disk in the array, which is the stored procedure of the data in all the hard drives in the display. The data in the file is segmented into small chunks of data that are stored sequentially on the hard disk in the array, and this smallest chunk is called a stripe unit.
Three factors that determine ceph stripe data:
Object size, bar bandwidth, stripe total
Object with PG, OSD:
After Ceph stripe, you get n with a unique OID (that is, the ID of object). The object ID is generated by linear mapping, which is concatenating by the ordinal of the file's metadata, the object generated by Ceph stripe. At this point the object needs to be mapped to the PG, which consists of two parts.
1) computes the OID of the object by the static Hsah function specified by the Ceph cluster, obtaining its hash value.
2) The hash value is manipulated with mask to obtain the PG ID.
From the PG mapping to the data store is a few unit OSD, the mapping is determined by the crush algorithm, the PG ID as the input of the algorithm, the collection of n OSD is obtained, the first OSD is used as the main OSD, the other sequence as from the OSD.
Note: The result of the crush algorithm is not absolutely constant and will be affected by (1) The current system State (2) Storage Policy configuration
The impact. The actual situation, the policy configuration does not change, the system state is generally the device damage. However, CEPH provides automated support for this situation.
In Ceph, the PG is all called Placementgroup, and the Chinese name is the collocated group. As the name implies, the purpose of PG is to make certain things logically grouped, so as to achieve unified management, improve the efficiency of the role.
In fact, Ceph, through the crush algorithm, maps several objects onto the PG to form a logical set of object and PG, which is used as the middle layer of object and OSD to copy the PG to multiple OSD depending on the number of copies of the pool. The following diagram depicts the mapping between the object, PG, and OSD Daemons:
1, PGP plays the role of co-locating pg. PGP is the logical bearer of pg.
2, the value of PGP should be the same as PG, the value of PG increases, but also to increase the value of PGP to maintain the same value.
3, when the PG of a pool increases, Ceph does not start rebalancing (data migration), only when the value of PGP increases, the PG will start to migrate to the other OSD, and start rebalancing.
In a CEPHJ cluster, you need to map two times when there are data objects to be written to the cluster.
First: PG--Osdset, Object---the second PG
Each mapping is independent of other objects, reflecting the independence (full dispersion) and certainty of the crush (deterministic storage location)
Crush Relationship Analysis:
Crush calculates the distribution of data objects by the weight of the storage device. In the process of calculation, the final position of the data object is determined by cluster map (cluster map), data Distribute policy (distribution strategy) and a random number given.
Cluster Map: Record all available storage resources and spatial hierarchies between each other (how many racks are in the cluster, how many servers are on the rack, how many disks are on each machine)
Cluster Map is made up of device and bucket, which have their own ID and weight values, and eventually form a tree structure with device as leaf node and bucket as trunk.