[Distributed File System] Introduction to Ceph Principle __ceph

Source: Internet
Author: User
Tags posix

Ceph was originally a PhD research project on storage systems, implemented by Sage Weil in University of California, Santa Cruz (UCSC). But by the end of March 2010, you can find Ceph in the mainline Linux kernel (starting with version 2.6.34). Although Ceph may not be suitable for production environments, it is useful for testing purposes. This article explores the Ceph file system and its unique features, which make it the most attractive alternative to scalable distributed storage.

Ceph Target

Why choose "Ceph"?

"Ceph" is a strange name for a filesystem, breaking the classic acronym trend that most people follow. This name is related to the mascot of UCSC (Ceph's birthplace), the mascot is "Sammy", a banana-colored slug, a mollusk with no shell in its head. These are many tentacles of the head-foot animals, providing a distributed file system of the most figurative metaphor.

Developing a Distributed File system requires a great deal of effort, but it is priceless if the problem is solved accurately. Ceph's goal is simply defined as:

Can be easily scaled to petabytes of capacity

High performance for multiple workloads (input/output operations per second [IOPS] and bandwidth)

High reliability

Unfortunately, these goals compete with each other (for example, scalability can degrade or inhibit performance or impact reliability). Ceph has developed some very interesting concepts (for example, dynamic metadata partitioning, data distribution, and replication), and these concepts are discussed briefly in this article. The Ceph design also includes fault-tolerant capabilities to protect a single point of failure, which assumes that large-scale (PB-level storage) storage failures are a common phenomenon rather than an exception. Finally, its design does not assume a particular workload, but includes the ability to adapt to changing workloads and provide optimal performance. It accomplishes all of these tasks with POSIX compatibility, allowing it to deploy transparently to applications that currently rely on POSIX semantics (through improvements that target Ceph). Finally, Ceph is open source distributed storage and part of the mainline Linux kernel (2.6.34).

Back to the top of the page

Ceph Architecture

Now, let's explore the architecture of Ceph and the core elements of the high-end. Then I will extend it to another level to illustrate some of the key aspects of Ceph and provide a more detailed discussion.

The Ceph ecosystem can be roughly divided into four parts (see Figure 1): A client (data user), a metadata server (caching and synchronizing distributed metadata), an object storage cluster (storing data and metadata as objects, performing other critical functions), and the final cluster monitor (performing monitoring functions).

Figure 1. Conceptual framework of Ceph ecosystem

  

As shown in Figure 1, the customer uses the metadata server to perform metadata operations (to determine the data location). The metadata server manages the data location and where new data is stored. It is noteworthy that the metadata is stored in a storage cluster (labeled "Metadata I/O"). The actual file I/O occurs between the client and the object storage cluster. As a result, a higher level of POSIX functionality (for example, open, close, rename) is managed by the metadata server, but POSIX functions, such as read and write, are managed directly by the object storage cluster.

Another schema view is provided by Figure 2. A series of servers access the Ceph ecosystem through a client interface, which understands the relationship between the metadata server and the object-level memory. Distributed storage systems can be viewed in a number of layers, including the format of a storage device (Extent and b-tree-based Object File system [Ebofs] or an alternative), and a design for managing data replication, fault detection, recovery, And the overlay management layer for subsequent data migrations, called reliable autonomic distributed Object Storage (Rados). Finally, the monitor is used to identify component failures, including subsequent notifications.

Figure 2. A hierarchical view of Ceph ecosystem simplification

  

Back to the top of the page

Ceph components

Once you understand the conceptual architecture of Ceph, you can dig to another level to understand the main components implemented in Ceph. One of the important differences between Ceph and traditional file systems is that it uses intelligence in the ecosystem rather than the file system itself.

Figure 3 shows a simple Ceph ecosystem. Ceph client is a user of the Ceph file system. The Ceph Metadata Daemon provides a metadata server, while the Ceph Object Storage Daemon provides actual storage (both for data and metadata). Finally, Ceph Monitor provides cluster management. Note that Ceph customers, object storage endpoints, metadata servers (depending on the capacity of the file system) can have many, and at least a pair of redundant monitors. So how does this file system distribute?

Figure 3. A simple Ceph ecosystem

  

Ceph Client

Kernel or user space

Earlier versions of Ceph leveraged the filesystems of user space (FUSE) to push file systems into user spaces and to a large extent simplify their development. Today, however, Ceph has been integrated into the mainline kernel to make it faster because the user space context switch is no longer needed for file system I/O.

Because Linux displays a common interface to the file system (via a virtual file system switch [VFS]), the Ceph user perspective is transparent. The administrator's perspective is certainly different, given the potential for many servers to contain the storage system (see the Resources section for more information on creating Ceph clusters). From a user's point of view, they have access to large storage systems without knowing the metadata servers, monitors, and separate object storage devices that are aggregated into a large-capacity storage pool below. The user simply sees an installation point where standard file I/O can be performed.

The Ceph file system-or at least the client interface-is implemented in the Linux kernel. It is noteworthy that in most file systems, all control and intelligence is performed in the kernel's filesystem source itself. However, in Ceph, the intelligence of the file system is distributed on the nodes, which simplifies the client interface and provides a large (or even dynamic) extension capability for Ceph.

Instead of relying on an allocation list (mapping a block on a disk to the metadata of a specified file), Ceph uses an interesting alternative. A file in a Linux perspective is assigned to an inode number (INO) from the metadata server, which is a unique identifier for the file. The file is then pushed into some objects (depending on the size of the file). With INO and object number (ONO), each object is assigned to an object ID (OID). A simple hash is used on the OID, and each object is assigned to a drop group. A drop group (identified as Pgid) is a conceptual container for an object. Finally, the mapping of the drop group to the object storage device is a pseudo random mapping, using an algorithm called controlled Replication Under Scalable hashing (CRUSH). As a result, the mapping of the Drop group (and copy) to the storage device does not depend on any metadata, but relies on a pseudo-random mapping function. This operation is ideal because it minimizes the overhead of storage and simplifies allocation and data queries.

The final component of the assignment is the cluster mapping. Cluster mapping is a valid representation of a device, showing a storage cluster. With Pgid and cluster mappings, you can locate any object.

Ceph Meta Data server

The work of the metadata server (CMDS) is to manage the namespace of the file system. Although both metadata and data are stored in an object storage cluster, they are managed separately to support scalability. In fact, metadata is further split on a metadata server cluster, and metadata servers can adaptively replicate and allocate namespaces to avoid hotspots. As shown in Figure 4, the metadata server manages the namespace part and can overlap (for redundancy and performance). The mapping of metadata servers to namespaces is performed using dynamic subtree logical partitions in Ceph, which allows Ceph to adjust changing workloads (migrating namespaces between metadata servers) while preserving the location of performance.

Figure 4. Partition of the Ceph namespace of the metadata server

  

But because each metadata server simply manages the namespace of the client population, its primary application is an intelligent metadata cache (because the actual metadata is ultimately stored in the object storage cluster). The metadata for the write operation is cached in a short log, and it is eventually pushed into the physical memory. This action allows the metadata server to return the most recent metadata to the customer (which is common in metadata operations). This log is also useful for failback: If the metadata server fails, its logs are replayed to ensure that the metadata is securely stored on disk.

The metadata server manages the Inode space and converts the file name to metadata. The metadata server converts file names to index nodes, file sizes, and Ceph data (layouts) for file I/O by clients.

Ceph Monitor

Ceph contains monitors that implement cluster mapping management, but some of the elements of fault management are performed in the object store itself. When an object storage device fails or a new device is added, the monitor detects and maintains a valid cluster mapping. This function is performed in a distributed manner, in which the mapping upgrade can communicate with the current traffic. Ceph uses Paxos, which is a series of distributed consensus algorithms.

Ceph Object Storage

Like traditional object storage, Ceph storage nodes include not only storage but also intelligence. Traditional drivers are simple targets that respond only to commands from the initiator. But object storage devices are smart devices that can serve as targets and initiators, supporting communication and collaboration with other object storage devices.

From a storage standpoint, the Ceph object storage device performs a mapping from an object to a block (a task that is often performed in the client's filesystem layer). This action allows the local entity to best determine how to store an object. Earlier versions of Ceph implemented a custom low-level file system on a local storage called Ebofs. The system implements a non-standard interface to the underlying storage that has been tuned for object semantics and other features, such as asynchronous notifications for disk submissions. Today, the B-tree file system (BTRFS) can be used for storage nodes, and it has implemented some of the necessary functionality (for example, embedded integrity).

Because Ceph customers implement CRUSH and know nothing about the file map blocks on disk, the following storage devices can safely manage object to block mappings. This allows storage nodes to replicate data (when a device fails). Allocation failure recovery also allows storage system expansion because

Fault detection and recovery of trans-ecosystem allocations. Ceph calls it the Rados other interesting features

If the dynamic and adaptive nature of the file system is not enough, Ceph also performs some of the interesting features that are visible to the user. Users can create snapshots, for example, on any subdirectory of Ceph (including all content). File and capacity calculations can be performed at the subdirectory level, which reports the storage size and number of files for a given subdirectory (and what it contains).

The status and future of Ceph

Although Ceph is now integrated into the mainline Linux kernel, it is only identified as experimental. File systems in this state are useful for testing, but are not prepared for the production environment. But given that Ceph joined the Linux kernel, and that the founders wanted to continue to develop, it should soon be able to solve your massive storage needs.

Other Distributed File systems

Ceph is not unique in Distributed file system space, but it is unique in the way it manages a large-capacity storage environment. Other examples of distributed file systems include Google file system (GFS), General Parallel, file System (GPFS), and lustre, which refer only to a subset. The idea behind Ceph provides an interesting future for distributed file systems, as massive levels of storage create the only challenge to a massive storage problem.

original articles, reproduced please specify: reprinted from Web development

View Full text: [Distributed File System] Ceph principle Introduction W3ccollege guess you also like: [Distributed File System] Ceph installation, configuration, optimization [Distributed File System]ceph 1–rados [Distributed File System] Distributed File system comparison, application areas, principles, selection ( Moosefs,fastdfs,mogilefs,glusterfs,ceph,nfs,lustre)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.