How Ceph Works and installs

Source: Internet
Author: User

I. Overview

Ceph is a distributed storage system that was born in 2004 and was the first to develop the next generation of high-performance distributed file systems projects. With the development of cloud computing, Ceph has become one of the most focused projects in the open source community by taking the spring breeze of OpenStack.
Ceph has the following advantages:

1. Crush algorithm

The crush algorithm is one of the two innovations of Ceph, and in a nutshell, Ceph abandons the traditional centralized storage metadata addressing scheme and instead uses the crush algorithm to perform the data addressing operations. Crush the isolation of disaster tolerant domain on the basis of consistent hashing, it can realize the rules of copy placement for various kinds of loads, such as cross-room, rack-sensing, etc. The crush algorithm has a fairly strong extensibility and theoretically supports thousands of storage nodes.

2. High Availability

The number of copies of data in Ceph can be defined by the administrator, and the crush algorithm can be used to specify the physical storage location of the replicas to separate the fault domains, supporting strong data consistency; Ceph can tolerate multiple failure scenarios and automatically try parallel fixes.

3. High scalability

Ceph differs from Swift in that all client read and write operations go through the proxy node. Once the cluster concurrency increases, the agent node can easily become a single point bottleneck. Ceph itself does not have a master node, it is easier to scale up, and in theory, its performance will grow linearly as the number of disks increases.

4. Rich Features

Ceph supports three calling interfaces: 对象存储 , 块存储 , 文件系统挂载 . Three ways to use it together. In some domestic companies ' cloud environments, Ceph is typically used as the only back-end storage for OpenStack to improve the efficiency of data forwarding.

Two, the basic structure of ceph ceph basic composition structure such as:

The bottom of Ceph is that Rados,rados itself is also a distributed storage system, and all of Ceph's storage functions are based on Rados implementations. Rados is developed in C + + and provides native Librados APIs that include two types of C and C + +. Ceph's upper-level application invokes the Librados API on the native, which communicates with the other nodes in the Rados cluster through the socket and completes various operations.

RADOS GateWay, RBD its role is to provide a higher level of abstraction, ease of use, or client-side interface on the basis of the Librados library. Among them, RADOS GW is a gateway to provide restful APIs compatible with Amazon S3 and Swift for use in the development of object storage applications. RBD provides a standard block-device interface that is often used to create volume for virtual machines in virtualized scenarios. Currently, Red hat has integrated the RBD driver into the KVM/QEMU to improve virtual machine access performance. Both of these approaches are now more widely used in cloud computing.

The CEPHFS provides a POSIX interface that users can mount directly from the client. It is a kernel-state program, so there is no need to call the Librados Library of user space. It interacts with Rados through the net module in the kernel.

III. Basic components of Ceph

As shown, Ceph mainly has three basic processes
    • Osd

      Used for storage of all data and objects in the cluster. Handle the replication, recovery, backfill, and equalization of cluster data. and sends a heartbeat to other OSD Daemons, and then provides some monitoring information to Mon.
      When the Ceph storage cluster data has two copies (two copies), at least two OSD daemons or two OSD nodes are required for the cluster to reach the Active+clean state.

    • MDS (optional)

      Provides metadata calculation, caching, and synchronization for the Ceph file system. In Ceph, the metadata is also stored in the OSD node, and the MDS is similar to the metadata of the proxy cache server. The MDS process is not a required process and you need to configure the MDS node only if you need to use CEPHFS.

    • Monitor

      Monitor the status of the whole cluster, maintain the cluster map binary tabulation of the cluster, and ensure the consistency of the cluster data. Clustermap describes the physical location of the object block storage and a list of buckets that aggregate the device to a physical location.

Iv. The OSD first describes the stored procedures for CEPH data, such as:

Regardless of the storage method used (object, block, Mount), the stored data is cut into objects (Objects). Objects size can be adjusted by an administrator, typically 2M or 4 m. Each object will have a unique OID, generated by Ino and Ono, although these nouns look complicated, which is actually quite simple. Ino is the file ID of the document that uniquely identifies each file globally, while Ono is the number of the Shard. For example: A file Fileid is a, it is cut into two objects, an object number 0, another number 1, then the OID of the two files is A0 and A1. The benefit of the OID is that each different object can be uniquely labeled, and the dependencies of the object and the file are stored. Since all of Ceph's data are virtual uniform objects, the efficiency will be higher when reading and writing.

Objects are not stored directly in the OSD, however, because the size of the object is small and there may be hundreds of to tens of millions of objects in a large cluster. So many objects simply traverse the address, the speed is very slow, and if the object directly through a fixed mapping of the hash algorithm mapped to the OSD, when the OSD is damaged, the object cannot be automatically migrated to the other OSD (because the mapping function is not allowed). To address these issues, Ceph introduces the concept of a collocated group, the PG.

PG is a logical concept, we can see the object directly in the Linux system, but we can't see the PG directly. It is similar to the index in the database when the data is addressed: Each object is fixed to be mapped into a PG, so when we are looking for an object, we just need to find the PG to which the object belongs, and then traverse the PG, without traversing all the objects. Also, in the case of data migration, a PG is migrated as a basic unit, and Ceph does not manipulate the object directly.

How is the object mapped into the PG? Do you remember OID? First, using the static hash function to hash out the OID character code, with the number of features and PG to modulo, the number is pgid. Because of this design, the number of PG directly determines the uniformity of the data distribution, so the reasonable number of PG can improve the performance of the Ceph cluster and make the data evenly distributed.

Finally, the PG will be copied according to the number of copies set by the administrator, and then stored on different OSD nodes via the crush algorithm (in fact, all the objects in the PG are stored on the node), the first OSD node is the primary node and the rest is the slave node.

The following is a pseudo-code in Ceph that briefly describes Ceph's data storage process
locator = object_nameobj_hash =  hash(locator)pg = obj_hash % num_pgosds_for_pg = crush(pg)    # returns a list of osdsprimary = osds_for_pg[0]replicas = osds_for_pg[1:]

A better interpretation of the Ceph data flow stored procedures, data, either from the three interfaces which write, eventually cut into objects stored in the underlying Rados. Logically through the algorithm first mapped to the PG, the final storage near the OSD node. In addition to the concept described above, there is a pools concept.

Pool is an administrator-defined namespace, like any other namespace, used to isolate objects from the PG. When we invoke an API store that uses object storage, we need to specify which pool the object is to be stored in. Apart from isolating data, we can also set different optimization strategies for different pool, such as number of replicas, number of data cleansing, data block and object size.

OSD is a strong consistent distributed storage, its reading and writing processes such as


Ceph's read-write operation uses a master-slave model, and the client can only initiate requests to the main OSD node corresponding to the object when it reads and writes data. When the master node accepts a write request, it synchronously writes the data to the OSD. When all OSD nodes are written, the master node reports the write completion information to the client. Therefore, the high consistency of master-slave node data is ensured. While reading, the client will only initiate a read request to the main OSD node, and there will be no similar read-write separation in the database, which is also due to strong consistency considerations. Since all writes are to be handled by the main OSD node, the performance may be slow when the volume of data is large, in order to overcome this problem and allow ceph to support things, each OSD node contains a journal file, which is described later.

Data flow introduced here to the end, and now finally back to the point: OSD process. In Ceph, each OSD process can be called an OSD node, that is, each storage server may contain a large number of OSD nodes, each OSD node listens to different ports, similar to running multiple MySQL or redis on the same server. Each OSD node can set a directory as the actual storage area, or it can be a partition, a whole hard disk. For example, the current machine runs two OSD processes, each OSD listens to 4 ports, respectively, for receiving customer requests, transmitting data, sending heartbeat, synchronizing data and other operations.

As shown, the OSD node listens to TCP's 6800 to 6803 ports by default and, if there are multiple OSD nodes on the same server, sorts sequentially.

There may be a minimum of hundreds of OSD in a production environment, so each OSD has a global number, similar to OSD0,OSD1,OSD2 .... The sequence number is arranged according to the order of the OSD birth and is globally unique. OSD nodes that store the same PG, in addition to sending heartbeats to the Mon node, also send heartbeat information to each other to detect the correct copy of the PG data.

Before describing the flow of data, each OSD node contains a journal file, such as:


The default size is 5G, it is said that every creation of an OSD node, has not been used will be journal take up 5G space. This value can be adjusted, depending on the total size of the OSD.

The journal function is similar to the thing log system in the MySQL InnoDB engine. When there is a burst of bulk write operations, Ceph can save some scattered, random IO requests to the cache for merging, and then unify the IO request to the kernel. This is more efficient, but once the OSD node crashes, the data in the cache is lost, so the data is recorded in journal when it is not yet written into the hard disk, and when the OSD crashes, it automatically tries to recover the cached data from the journal due to the crash. So journal io is very dense, and because a data to Io two times, to a large extent also lossy hardware IO performance, so usually in production environment, using SSD to separate storage journal files to improve Ceph read and write performance.

  

V. Monitor node

The Mon node monitors the status information of the entire Ceph cluster and listens on TCP port 6789. There must be at least one Mon node in each ceph cluster, which is officially recommended for at least three deployments per cluster. The Mon node holds the master copy of the latest version of the cluster data distribution map (cluster map). When the client is in use, it needs to mount the 6789 port of the Mon node, download the latest cluster map, obtain the IP address of each OSD in the cluster through the crush algorithm, and then establish a connection directly with the OSD node to transfer the data. So for Ceph, there is no need for a centralized master node for computation and addressing, and the client is sharing this part of the work. And the client can also communicate directly with the OSD, eliminating the additional overhead of intermediate proxy servers.

Mon nodes use the Paxos algorithm to maintain the consistency of each node cluster map; The functions of each Mon node are the same in general, and the relationship between them can be understood as a primary preparation. If the main Mon node is damaged, the cluster can function correctly when the other Mon surviving nodes are more than half. When the failed Mon node resumes, it proactively pulls the latest cluster map to the other Mon nodes.

The Mon node does not actively poll the current status of each OSD, instead, the OSD will only report its own information in some special cases, usually simply sending a heartbeat. Special cases include: 1, the new OSD is added to the cluster, 2, an OSD found itself or other OSD exception. When the Mon node receives these escalation information, it updates the cluster map information and spreads it.

  

Cluster map information is spread asynchronously and in the form of lazy. Monitor does not broadcast the new version to all OSD after each cluster map update, but updates it back to the other when there is an OSD reporting it to itself. Similarly, each OSD is also in communication with the other OSD, if found in the other side of the OSD is held in the cluster map version is lower, then the version of their own updated to the other side.

  

It is recommended to use the following schemas

Here, in addition to the management network, Ceph set up two network segments, one for the client to read and write transmission data. The other is used to synchronize data between OSD nodes and send heartbeat information. The advantage of this is that the IO pressure of the NIC can be shared. Otherwise, the client's read and write speed becomes extremely slow when the data is cleaned.

VI. MDS

MDS is a meta-data server in a ceph cluster, and usually it is not necessary, because it is only needed when using CEPHFS, and the more extensive it uses in cloud computing is the other two ways of storage.

Although MDS is a metadata server, it is not responsible for storing metadata, which is also cut into objects that exist in each OSD node, such as:

When creating CEPHFS, create at least two pool, one for storing data and the other for metadata. The MDS is simply responsible for accepting the user's metadata query request and then extracting the data from the OSD to map it into its own memory for customer access. So MDS is similar to a proxy cache server, the OSD to share the user's access pressure, such as:

Vii. Simple installation of CEPHFS

Before installing Ceph, it is recommended that all CEPH nodes be set to no password SSH visits, configure hosts to support host name visits, synchronize time, and close iptables and SELinux.

1, the experimental environment description:

The current lab environment uses 4 hosts NODE1~NODE4,NODE1 as the management node.

2. Deployment Tools:

Ceph official launched a Python-written tool cpeh-deploy, can greatly simplify the configuration of the Ceph cluster process, we recommend that you use. Its Yum warehouse address, as follows:

http://download.ceph.com/rpm-firefly/el6/noarch/
3. Installation Steps
  • Install the tool (usually a springboard) on the management host
    yum install -y ceph-deploy
  • Create a working directory to hold information such as the generated configuration file and secret key
    Mkdir /ceph;cd /ceph
  • Download the Yum source as follows
    http://download.ceph.com/rpm-firefly/el6/noarch/
  • Set the above URL to the Yum source on node1~4
    yum install –y ceph
  • To the/ceph directory operation on the administrative host, create a new cluster and set the Node1 to Mon node
    ceph-deploy new node1执行完毕后,可以看到/ceph目录中生成了三个文件,其中有一个配置文件可以做各种参数优化,据说ceph的优化参数接近1000项。(注意,在osd进程生成并挂载使用后,想修改配置需要使用命令行工具,修改配置文件是无效的,所以需要提前规划好优化的参数。)
  • add four most basic settings in ceph.conf
      echo" OSD pool Default size = 4 ">> ceph.conf echo" Osd_po Ol_default_min_size = 3 ">> ceph.conf echo" Public network = 192.168.120.0/24 ">> ceph.conf echo" Cluster NETW ork =  10.0.0.0/8 ">> ceph.conf Set the default number of copies per pool is two (all files coexist four copies, if not set this will default to three copies); Set the minimum number of copies to 3, that is, 4 copies of the environment has a copy of the damage, the other OSD can be as normal as the user's read and write requests, set the public network address segment, which is used for the corresponding customer read and write network segment, set up the cluster work network segment, for cluster synchronization data, send heartbeat and other network segments used. 
  • Activating the Monitoring node
    ceph-deploy mon create-initial
  • Next, the OSD node is created, in this case the entire partition is used as the physical storage area of the OSD node
    ceph-deploy osd prepare node2:/dev/sdb1 node3:/dev/sdb1 node4:/dev/sdb1ceph-deploy osd prepare node2:/dev/sdb1 node3:/dev/sdb1 node4:/dev/sdb1
  • To synchronize the configuration files on the management node to the other nodes
    ceph-deploy --overwrite-conf admin node{1..4}
  • Establishing a metadata server
    ceph-deploy mds create node1
  • Create two pools, the last number is the number of PG
    ceph osd pool create test1 256ceph osd pool create test2 256
  • Create a CEPHFS file system, note that a ceph can only create one CEPHFS
    ceph fs new cephfs test2 test1默认第一个池会存储元数据
To this a simple CEPHFS cluster is born, you can use ceph –sView, if it is HEALTH_OKStatus Description configuration succeeded Viii. deletion of CEPHFS
  • Remove the ceph on all nodes
    ceph-deploy purge node{1..4}
  • Clear All data
  • ceph-deploy purgedata node{1..4}
  • Clear all secret key files
    ceph-deploy forgetkeys
Ix. Conclusion

For now, Ceph is still hot in the open source community, but it's more of a back-end store for cloud computing. The official recommendation is to use Ceph's object-type storage, which is both faster and more efficient, and CEPHFS is not recommended to be used directly in production. The above is only Ceph bucket, Ceph is far more complex than described above, and support many features, such as the use of erasure code on the line addressing, so most of the companies in the production environment using CEPH will have a dedicated team of Ceph two development, ceph operation is more difficult. But after a reasonable optimization, Ceph's performance and stability are worth looking forward to.



How Ceph Works and installs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.