Ceph Translations Rados:a Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters

Last Update:2016-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Rados:a scalable, Reliable Storage Service for Petabyte-scale Storage Clusters

Thesis translation

Summary

Block and object-oriented storage architectures form a storage cluster that promotes extensibility. However, existing systems continue to use storage nodes as a passive device, although they have the ability to demonstrate intelligence and autonomy. We propose rados design and implementation, Rados is a reliable object-oriented service that can be extended to thousands of devices by leveraging the intelligence of each individual node. When allowing nodes to be semi-autonomous by using cluster maps for self-replication, error detection, error recovery, Rados to protect data consistency and strong security semantics. Our implementations provide excellent performance, reliability, and scalability, while providing the client with a logical object store.

1 Introduction

Providing reliable, high-performance scale-up storage for system designers poses challenges. High throughput and low latency for file systems, databases and associated abstract storage are important for a wide range of applications. converged clustered storage architectures based on brick storage or object storage devices (OSD) are seeking to distribute low-level block allocation solutions and security enforcement to intelligent storage devices, thereby simplifying data distribution and eliminating IO bottlenecks by facilitating direct client access to data to address intelligent storage devices. The OSD based on the commercial component is a combination of CPU, network interface, local cache, base disk or RAID, replacing the storage based on the block device interface with a named, variable-length object.

However, the adoption of this architecture on a large scale systematically fails to take advantage of the intelligence of the device. Because a protocol-based storage system is a local or storage-network storage device or a device that is compatible with the T10OSD protocol, the device passively reads and writes commands, although they have the potential to encapsulate obvious intelligence. When the storage cluster grows to thousands of nodes or more, the consistency management of data migration, error detection, error recovery will bring great pressure to the client, controller, metadata directory node, and limit the scalability.

We have designed and implemented Rados, a reliable, automated distributed object store that seeks to distribute device intelligence to complex thousands of-node-scale clusters involving data-consistent access, redundant storage, error detection, and recovery of logging problems. As part of the Ceph Distributed system, Rados facilitates an optimized, balanced distributed data and load distributed across dynamic, heterogeneous storage clusters, while Rados provides a single logical object store with good security semantics and strong consistent security for applications.

For petabytes of scale, storage system dynamics are necessary: they are dynamically expanded, they grow and connect with newly deployed storage or retire old devices, device error and recovery is based on a continuous data base, and a large amount of data is created and deleted. Rados ensures the consistency of the data distribution to the system and the consistency of the read and write objects based on the cluster map. This map is replicated to all nodes (the storage and client nodes) and is updated by the lazy propagation increments.

By providing storage nodes with complete data distribution information in the system, the device is able to self-manage data replication, consistent and secure process updates, participate in error detection, respond to errors, and data distribution changes resulting from data object replication migrations in a semi-autonomous manner through a peer-to-peer protocol. This eases the pressure on the small monitor cluster that manages the primary copy of the cluster map, allowing the remaining storage clusters to be seamlessly scaled from dozens of nodes to thousands of nodes.

Our prototype implementation provides an object interface (like a file) that can be read and written in a byte range, because that is the requirement of our most primitive ceph. In order to prevent node errors and protect data objects, the data objects in the OSD cluster are copied through n paths. However, the scalability of the rados cannot depend on a particular object interface or redundancy strategy; the storage Key-value objects and RAID redundancy are planned.

2 Scalable Cluster Management

A rados system consists of a collection of OSD nodes plus a small, monitor collection that manages the OSD cluster membership. Each OSD node includes a CPU, some volatile memory, a network interface and a local disk or raid. Monitors is a standalone process and requires a small amount of local storage.

2.1 Cluster Map

The monitor cluster manages the cluster map so that the storage cluster is managed with mutual exclusion. Cluster map specifies which Osds are included in the cluster and succinctly specifies the distribution of the entire data in the system. Cluster map is replicated by each storage node and those clients interacting with the Rados system. Because the cluster map specifies the data distribution in its entirety, the client provides a simple interface to virtualize the entire storage cluster (possibly 10,000 nodes) into a single logical object store.

The map epoch increment is triggered every time a change in the cluster map is caused by an OSD state change (such as a device error) or other event triggering the data distribution. Map epochs allows the communication parties to agree on what the current data distribution is and decide when their information is timed out. Because Cluster map changes very often, OSDS error detection and recovery is common in very large clusters to update and distribute map increments (Cluster map): Small information to describe the differences between the two successful map epochs. In most cases, these updates only illustrate one or more OSD node errors or error recovery, although generally these updates may include a change in the state of many devices, and many updates are bundled together to describe a long-spaced map version.

2.2 Data Placement

Rados uses a protocol that distributes data pseudo-randomly to devices. When new devices are added, a copy of the random existing data is migrated to the new device for load balancing. This strategy allows the system to maintain a certain probability of a balanced distribution, basically, to keep all the devices have a similar load, allowing the system under any possible load can be well run. Most importantly, data replication is a two-phase process that calculates the correct object storage location and does not require a large and cumbersome centralized allocation table.

Each existing system object is first mapped to the placement Group (Placement Group), a collection of logical objects copied by the same set of devices. The PG of each object is determined by the hash value of the object name O, the level r of the data master-slave copy, and a bitmask m that controls the total number of PG. So, pgid= (R,hash (o) & M), M = 2 K-1, thus limiting the number of PG is 2 of the n-th side.
For a cluster, it is periodically necessary to adjust the total number of placement groups by changing M, which is gradual to reduce the migration of PG between devices.

Based on the cluster map,placement group is assigned to the OSD node, each PG is mapped to an ordered list of R-OSD nodes, and copies of the data are stored on these mapped pg. Our implementation utilizes the crush, a robust copy distribution algorithm, used to calculate a stable, pseudo-random mapping. (Other possible replication strategies; For very large clusters, even a table that maps a PG to a device is still relatively small (MB level). From the top, crush behaves like a hash function: Placement groups is deterministic and pseudo-randomly distributed. Unlike hash functions, crush is stable: When a node joins or leaves a cluster, pgs remains unchanged in the original storage location, and the crush only transfers enough data to maintain a balanced distribution. In contrast, hashing is more inclined to re-shuffle the main mappings in a more compelling way. Crush also controls the amount of data that is allocated to each device based on capacity and performance.

PG provides a way to control the level of the copy distribution strategy, distributed dispersion (replication declustering). That is, an OSD does not share all replicas (mirrors) with one or more devices, nor does it share each object with different devices (complete declustering), and the number of copies of peers is related to the number of PG numbers it stores. Typical is the order 100PG per OSD. Because the distributions are random, you also affect the variance of device utilization: The more the PG on the OSD, the more evenly the distribution. More importantly, Declustering has the advantage of distributed, by allowing each PG to be re-copied independently to the other OSD can be parallel error recovery. At the same time, the system can limit the simultaneous failure of devices by limiting the number of identical data shared by each device.

2.3 Device State

The Cluster map contains description information about the device, device status information, and data distribution information. Contains the current online network address of all the OSD and indicates which devices are unreachable (down). Rados will take into account the active degree of the OSD.

For each pg,crush, an R OSD is found from the mapping. Rados then filters out those down-state devices and generates a avtive-status OSD list for the PG. If the active list is empty, the PG data will be unavailable and IO will be block.

Corresponds to the active IO service, the OSD is usually up and in status. If it goes wrong, it should be down and out, producing a actvie list corresponding to the R OSD. The OSD may also be down but still in a mapping, meaning they are currently unreachable, but the PG data has not been remapped to the other OSD (similar to degraded mode in the raid system). In turn, they may also be up and out, meaning they are online, but still in an idle state. This facilitates the handling of a variety of scenarios, including non-initialization of any data migration for intermittent intolerance (for example, an OSD restart or intermittent interruption of the network), a new deployment store that is temporarily unused (for example, for testing), and a safe migration of data before the device is rolled out.

2.4 Map Propagation (map propagation)

Since the Rados cluster pit contains thousands of devices, simple broadcast map update messages to each node are impractical. Fortunately, the difference between the different MA versions is obvious, and only if the OSD (or a client and OSD) of the two communication is not the same, they are updated according to the appropriate rules. This feature can be a rados distribution delay distribution map update, which, by combining OSD internal messages, efficiently transfers distributed loads.

Each OSD maintains a history of the map update, with one epoch tag for each message, and keeps track of the latest epoch that appears in each peer. If the OSD receives a peer to bring an old map, it will bring the necessary increments to the peer to keep it in sync. Similarly when sending peer has an old map, an incremental update will also be shared from peer to peer. Heartbeat messages are exchanged periodically to detect exceptions to ensure rapid spread of updates, and for a cluster with n OSD, the time to use is O (Logn).

For example, when an OSD is started, it notifies a monitor via a osdboot message that contains its latest map epoch. The monitor cluster updates the status of the OSD to up, then brings the updated cluster map to the OSD, and when the new OSD communicates with the other OSD, the updated map will be shared with the OSD. Because this new OSD does not know which epoch the other peer has, it will share a secure current incremental update history.

This map sharing mechanism is conservative: when an OSD contacts other peers, it will share and update the map, and when the peer and see it, it will cause the OSD to receive sufficient update information. However, the number of duplicate maps received by an OSD is related to how many peers it has, and this number is determined by the number of PG that it manages. In fact, we find that the update repetition level is less than this value.

3 Smart Storage Devices

Data distribution information is encapsulated in the cluster map, which enables Rados to distribute data redundancy management, fault detection, and failback of storage clusters on individual OSD. By adopting similar-to-peer protocols, the intelligence of the OSD node is fully utilized in the high-performance cluster environment.

Rados implements multi-path replication that combines each object version with a short log. Replication is done by the OSD itself: the client submits only one write operation to the master OSD, the master OSD replication Consistency, and all copies of the security update. This enables mobile and replication-related operations to leverage the network bandwidth within the storage cluster, while simplifying the client's design. Object versions and logs help you recover quickly when a node fails.

We will mainly describe the Rados cluster architecture, especially how cluster map distributes replication and failback, and how this capability can be extended to introduce additional redundancy mechanisms (such as RAID-code-based shards).

Figure 2

3.1 Replication

Rados implements three different replication strategies: Primary copy, Chain, splay. The update operation's message exchange can be referenced in Figure 2. In all cases, the client sends IO operations to an OSD, and the cluster ensures that the replicas are securely updated and consistent read and write. Once all the replicas are updated, an ACK message is returned to the client.

Primary-copy replication updates all copies in parallel and processes read and write on the Primary OSD. Chain is a serial update: When the write command is sent to the primary node, and the Read command is sent to the tail node, ensure that read always reflects the entire copy of the update. Splay is to combine parallel updates in primary-copy with read and write roles in Chain-copy, and this benefit is to reduce the number of hops on the message for two mirrors.

3.2 Strong consistency

All Rados messages, including messages originating from the client and messages originating from other OSD, carry the map epoch on the sending side to ensure that all updates are consistent on the latest version. If a client uses an outdated map, sending an IO to an incorrect osd,osd will reply to an appropriate increment, and the client then resend the request to the correct OSD. This avoids the initiative to share the map to the client, the client will be updated when contacting cluster. Most of the time, they will learn the update without affecting the current operation, so that the next IO can be accurately positioned.

If the master copy of the cluster map has been updated to change the membership of a particular PG, the old members can still handle the update as if they had not felt the change. If the change is first known by a PG replica node, it will be found that when the primary OSD forwards the update message to the replica node, the replica node returns the delta of the map to the main OSD. This is completely safe because any new group of OSD that is responsible for a PG will need to contact the OSD previously responsible for this PG to ensure that the contents of the PG are correct, so that the previous OSD can learn the changes to the map and stop the IO operation before the new OSD takes over.

A consistent operation that completes similar reads will not be updated so naturally. When a network failure causes an OSD portion to be unreachable, the OSD that provides read services for a PG may be flagged as faulty, but may be accessible to those clients that have an outdated map. At the same time, the updated map may specify a new OSD. To prevent the new update from being processed by a new OSD, the old OSD can also handle the received read operation, we need periodic heartbeat messages, in order to keep this PG readable before the OSD is responsible for the same pg. If a read-only OSD does not hear the heartbeat of the other copy in H seconds, the read operation is blocked. Before the other OSD receives the primary role of a PG, he must obtain an old OSD confirmation (to ensure that they all know that their role has changed), or to delay a certain interval of time. For the current implementation, we use a relatively small heartbeat interval of 2 seconds. This can be carried out in a timely manner, and the failure of the main OSD can ensure that the data is not available for a short time.

3.3 Fault Detection

Rados uses an asynchronous, point-and-order message-passing library for communication. When a TCP socket fails, it causes a limited reconnection attempt before the monitor cluster is reported. The storage nodes periodically exchange heartbeat messages with their peers (those that share the same PG data with them) to ensure that the device fails to be detected in a timely manner. When the OSD finds that they have been marked down, the hard drive data is synchronized, and the kill itself guarantees the consistency of the behavior.

3.4 Data migration and failure recovery

Rados data migration and failback are driven entirely by the update of cluster map and the change of PG to OSD mapping. This change may be due to device failure, recovery, cluster expansion or contraction, and a new crush strategy has resulted in the redistribution of all data. Device failure is just one of the many factors that can cause a new cluster data distribution to be established.

Rados does not make any assumptions about data continuity. In all cases, the Rados employs a robust peering algorithm that enables a consistent view of the PG content and restores the appropriate data distribution and replication. The basic design of this strategy relies on the OSD to actively replicate a PG log, which records what the current content of a PG is (that is, the version of the contained object), even if the current object copy may be lost locally. Therefore, even if the recovery is slow, some time object security is degraded, the meta data of the PG can be guaranteed, thus simplifying the recovery algorithm and allowing the system to detect data loss.

3.4.1 Peering

When an SD receives a cluster map update, it iterates through all the first map increments, through recent checks and possible adjustments to the status value of the PG. The OSD changes in the Atcive list of any locally stored PG must be re-re-peer. Considering all the map epochs (not just the most recent), ensure that the intermediate data distribution is considered: If an OS is removed from a PG and then joined, it is important to confirm that the contents of the PG may be updated in the middle. Each PG in the system is processed with replication, peering, and other subsequent updates.

The peering is driven by the primary OSD in the OSD. For each PG that is not primary, the notification message is sent to the primary OSD. This message includes: Basic status information for the locally stored PG, including recent updates, a range of PG log information, which have been recently known to the epoch. The notification message guarantees a new primary OSD for a PG to be able to discover his new role without considering all possible PG (there may be millions of) for each map change. Once this is known, the Primary OSD can generate a prior set that contains all the OSD added to the PG because it has just established a peer relationship with these successes. This prior set can be explicitly queried to achieve a steady state, avoiding infinite waits on an OSD without political storage of this pg.

With the metadata for the existing PG set, the Primary OSD will be able to determine the most recent update applied to any replica and know which log fragment is requested from the prior OSD so that the PG log is updated on the active copy. If an available PG log is not sufficient (for example, one or more OSD does not have PG data), a complete PG content will be generated. For nodes to restart or other end interrupts, there is sufficient information for the synchronous replica PG log to be fast enough.

Preferably, the Primary OSD shares the missing log fragment with the other replica OSD, and all replica know which objects are included in the PG (even if they have not yet saved the object locally) so that the recovery process can be performed in the background.

3.4.2 Recovery

An important advantage of the declustered replication is the ability to perform parallel failure recovery. Any single failed device share is distributed across other OSD copies, each PG can be independently selected to replace and allow to re-copy as many OSD. On average, in a large system, any single fail-back OSD can replicate content using push and pull, which is a very fast recovery. Recovery can be stimulated by observing whether IO is read or not limited. Although each independent OSD has all the PG metadata, each missing object can be obtained independently, but this strategy has two limitations. One is that multiple OSD distributions recover objects in the same PG, which may at the same time not be downloaded to the same object on the same OSD. This can cause the maximum recovery overhead to be seeking and reading. In addition, this replica update protocol can become complex if the replica OSD missing object is changed.

For this reason, the PG recovery in Rados is coordinated by the primary OSD. Similar to the previous one, it is known that the primary OSD has a local copy before the missing objects are manipulated. Because the primary OSD through the peering process already knew all replicas missing which objects. It can push any object to the replica OSD and the simplified copy property price also guarantees that the copied object can only be read once. If the main OSD is pushing an object or it has just downloaded an object, it will always be pushed to all replicas. Therefore, each copied object is read only once.

4 Monitor

Monitors is a small cluster that manages the entire storage system by storing the master copy of the cluster map and periodically updating the configuration changes and OSD status. The cluster is based on the Paxos part-time Parliament algorithm, which is advantageous for consistency and durability with availability and update latency. It is important to note that most monitors must be available, in order to maintain cluster map reads and updates, and cluster map changes can be persistent.

4.1 Paxos Service

The cluster is based on the distributed state Machine service, based on the Poxos,cluster map is the current cluster state, and each update will go to only a new map epoch. The implementation slightly simplifies the standard poxos by allowing only one mutation at any time for the current map. Combining the basic Paxos algorithm with the lease mechanism allows any monitor to request directly before making sure that the read and update are in a consistent order. Cluster initialization Elected a leader this enables serialization of map updates and management consistency. Once selected, leader will request all the map epochs on each monitor. Monitor has a fixed time probe and family to quorum. If most of the monitor is active, the first phase of the Poaxos algorithm ensures that each monitor has a recently submitted map epoch, and then distributes the short leases to no active monitor. Each lease is licensed to the active monitor to give the OSD and client the right to distribute their requested cluster map copy. If the lease T expires and is not updated, the leader is considered to have failed and the election is re-elected. A confirmation message is sent to leader when the lease is received. If leader does not receive a confirmation message in time, he assumes that the active monitor has crashed and re-elected to build a new quorum. A re-election is triggered when a monitor is first launched, or if an election is not completed before a certain time interval has elapsed.

When an active monitor receives an update request (such as an error report), it will first see if the request is new. For example, if an OSD has been marked down, the monitor will give the corresponding map increment to the OSD that sent the message. The new error will be forwarded to Leader,leader to initialize the map update by adding the map epoch and using the Paxos update protocol to distribute the share to other monitor while revoking the lease. If the update is confirmed by most monitor, a new lease message is eventually submitted.

A two-phase assignment and periodic probing ensures that the active monitor collection is changed, and it guarantees that all current leases expire when the map is updated. Therefore, any sequence of map update queries and updates can be performed consistently on a map version. It is important that the version of map will never be rolled back, as long as most monitor is available, no matter which monitor message is sent to or any error that interferes with Monitor.

4.2 Workloads and scalability

Typically, monitor workloads are small, and most map distributions are done by storage node, and changes to the state of the device do not occur very often.

The lease mechanism used by the monitor cluster allows any monitor to request cluster map copies from the OSD or client. This request is rarely initiated by the OSD because the client typically requests that update only occur when the OSD operation times out, or when an error may occur, because of the preemptive map share. The monitor cluster is able to distribute these copies to a larger cluster.

Requests that require a map update are forwarded to the current leader. The lead collects multiple roots into a map update, and the frequency of the map update is adjustable, and is not related to the size of the cluster. Despite this, the worst load occurs when a large number of OSD faults occur simultaneously in a short period of time. If an OSD holds the U PG, and there is an F-OSD failure, there will be a maximum of UF error reports generated. When the OSD is large, these messages can be very numerous. To prevent messages from appetite, the OSD sends a heartbeat message at a pseudo-random interval, ensuring that errors are detected, suppressed, and assigned in a succession of reports. The monitor that is not leader will only be forwarded once for an error, so that the leader request load is FM, where M is the number of monitor in the cluster.

5 Partial evaluations

By using the performance of the object storage layer (EBOFS), combining ceph to each OSD performance has been pre-measured. Similarly, data distribution performance crush and their cluster effects on total throughput have been evaluated elsewhere. In this short article, we focus only on map assignment, as this will directly affect the ability of the cluster to expand. We have not yet experimented with evaluating monitoring cluster performance, although our confidence architecture is scalable.

5.1 Map Propagation

Rados distribution algorithm has been discussed in section 2.4, map update logn times can be updated.
When the cluster increases, equipment failure is more frequent, and the number of updates increases. Because map updates and interactions occur only between OSD sharing the same PG, the upper limit is proportional to the number of replicas on a single OSD receive.

In simulations under Near-worst case propagation circumstances with regular map updates, we found that update duplicates a Pproach a steady state even with exponential cluster scaling. In this experiment, the monitors share each of the map update with a single random OSD, and then shares it with its peers. In Figure 3 We vary the cluster size x and the number of PGs on each OSD (which corresponds to the number of peers it have) and measure the number of duplicate map updates received for every new one (y). Update duplication approaches a constant level-less than 20% Ofμ-even as the cluster size scales exponentially, implying A fixed map distribution overhead.

We Consider a worst case scenario in which the only OSD chatter is pings for failure detection, which means Ly Speaking,osds Learn about map updates (and the changes known by their peers) as slowly as possible. Limiting map distribution overhead thus relies only on throttling the map update frequency, which the monitor cluster alre Ady does as a matter of course.

6 Future work 6.1 Key-value Storage 6.2 Scalable FIFO Queue 6.3 granular snapshot 6.4 quality of service 6.5 redundancy based on parity 7 references:

[1] Ceph Learning –rados Paper
Http://www.yidianzixun.com/news_10373a5b3e9591ba337dc85059854d90
[2] Rados paper
http://blog.csdn.net/agony000/article/details/22697283
[3] RADOS: A scalable, highly available, Petabyte storage cluster (CEPH)
http://blog.csdn.net/user_friendly/article/details/9768577
[4] Weil s A, Leung a W, Brandt s A, et al rados:a scalable, reliable storage service for Petabyte-scale storage clusters[c] International Workshop on Petascale Data Storage:held in conjunction with Supercomputing. ACM, 2007:35-44.

Ceph Translations Rados:a Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More