CEpH: a Linux Pb-level Distributed File System

Source: Internet
Author: User


As an architect in the storage industry, I have a special liking for file systems. These systems are used to store the user interfaces of the system. Although they tend to provide a series of similar functions, they can also provide significantly different functions. CEpH is no exception. It also provides some of the most interesting features you can find in the file system.


Quorum consensus protocol in distributed database

CEpH was initially a PhD research project on storage systems, implemented by sage Weil at the University of California, Santa Cruz (UCSC. However, by the end of March 2010, you can find CEpH in the main Linux kernel (starting from 2.6.34. Although CEpH may not be applicable to the production environment, it is very useful for testing purposes. This article discusses the CEpH file system and its unique functions, which make it the most attractive alternative for Scalable Distributed Storage.


Distributed consensus in multi vehicle cooperative control

CEpH target


Distributed consensus



Why select "CEpH "?


"CEpH" is a strange name for a file system, breaking the typical abbreviated trend that most people follow. This name is related to the mascot of UCSC (the birthplace of CEpH), which is "Sammy", a banana-colored animal that contains no shells in the head and foot. These toutiao animals with multiple tentacles provide the most vivid metaphor for a distributed file system.


Consensus algorithm distributed systems 

Multiple efforts are required to develop a distributed file system. However, if the problem can be solved accurately, it is priceless. The CEpH goal is simply defined:


    • Easily scalable to petabytes of capacity
    • High Performance for multiple workloads (input/output operations per second [iops] and bandwidth)
    • High reliability

Distributed consensus ledger

Unfortunately, these goals compete with each other (for example, scalability can reduce or reduce performance or affect reliability ). CEpH has developed some very interesting concepts (such as dynamic metadata partitioning, data distribution, and replication). These concepts are only briefly discussed in this article. The CEpH design also includes fault tolerance to protect single-point faults. It assumes that large-scale (Pb-level storage) faults are common rather than exceptions. Finally, its design does not assume a special workload, but includes the ability to adapt to changes and provide optimal performance. It uses POSIX compatibility to complete all these tasks, allowing it to transparently deploy the applications that currently depend on POSIX semantics (through CEpH-oriented improvements. Finally, CEpH is an open-source distributed storage and part of the main Linux kernel (2.6.34.



Back to Top



CEpH Architecture



Now, let's discuss the CEpH architecture and the core elements of the high-end architecture. Then I will expand to another level to illustrate some key aspects of CEpH and provide more detailed discussions.



The CEpH ecosystem can be roughly divided into four parts (see figure 1): client (data user), metadata server (Cache and synchronization of distributed metadata ), an object storage cluster (storing data and metadata as objects for other key functions) and the final cluster Monitor (for monitoring ).




Figure 1. CEpH ecosystem conceptual architecture



1. The customer uses the metadata server to perform metadata operations (to determine the data location ). The metadata server manages data locations and where new data is stored. It is worth noting that metadata is stored in a storage cluster (marked as "metadata I/O "). The actual file I/O occurs between the customer and the OSS cluster. In this way, higher-level POSIX functions (such as opening, closing, and renaming) are managed by the metadata server, but POSIX functions (such as reading and writing) are directly managed by the OSS cluster.



Another architecture view is provided by Figure 2. A series of servers access the CEpH ecosystem through a customer interface, which understands the relationship between the metadata server and Object-level storage. The distributed storage system can be viewed at some layers, including the format of a storage device (Extent and B-tree-based object file system [ebofs] or an alternative ), there is also a design for managing data replication, fault detection, recovery, and subsequent data migration overwrites the management layer, calledReliable autonomic Distributed Object Storage(Rados ). Finally, the monitor is used to identify component faults, including subsequent notifications.




Figure 2. Simplified hierarchical view of the CEpH Ecosystem



Back to Top



CEpH component



After learning about the CEpH concept architecture, You can explore another layer to learn about the main components implemented in CEpH. One of the important differences between CEpH and traditional file systems is that it uses intelligence in the ecological environment rather than the file system itself.



Figure 3 shows a simple CEpH ecosystem. The CEpH client is a user of the CEpH file system. CEpH metadata daemon provides a metadata server, while CEpH Object Storage daemon provides actual storage (for both data and metadata ). Finally, CEpH monitor provides cluster management. It should be noted that CEpH customers, Object Storage endpoints, metadata servers (depending on the capacity of the file system) can have many, and at least have a pair of redundant monitors. How is the file system distributed?




Figure 3. Simple CEpH Ecosystem



CEpH Client





Kernel or user space


Earlier versions of CEpH used filesystems in user space (fuse) to push the file system into user space and greatly simplify development. However, today, CEpH has been integrated into the main kernel to make it faster, because the user space context switch no longer needs the file system I/O.



Because Linux displays a public interface of the file system (through the Virtual File System switch [VFS]), the CEpH user perspective is transparent. The Administrator's perspective must be different, considering that many servers may contain the storage system (for more information about how to create a CEpH cluster, see references ). From the user's point of view, they access a large-capacity storage system, but do not know the metadata servers, monitors, and independent object storage devices aggregated into a large storage pool. The user simply sees an installation point, at which point the standard file I/O can be executed.



CEpH file system-or at least client interface-is implemented in Linux kernel. It is worth noting that in most file systems, all control and intelligence are executed in the Kernel File System source. However, in CEpH, the file system is intelligently distributed on nodes, which simplifies client interfaces and provides large-scale (or even dynamic) Expansion capabilities for CEpH.



CEpH uses an interesting alternative, rather than the dependency allocation list (ing blocks on the disk to the metadata of the specified file ). In the Linux perspective, a file is allocated to an inode number (Ino) from the metadata server, which is a unique identifier for the file. Then the file is pushed into some objects (based on the file size ). Using iNO and object number (ONO), each object is assigned an object ID (OID ). Use a simple hash on the OID, and each object is assigned to a placement group.Placement Group(Identified as pgid) is an object concept container. Finally, the ing between placement groups and object storage devices is a pseudo-random ing.Controlled replication under scalable hashing(Crush)Algorithm. In this way, the ing between placement groups (and replicas) and storage devices does not depend on any metadata, but on a pseudo-random ing function. This operation is ideal because it minimizes the storage overhead and simplifies allocation and data query.



The final component of the allocation is the cluster ing.Cluster ingIs a valid representation of the device, showing the storage cluster. With pgid and cluster ing, you can locate any object.



CEpH metadata server



The metadata server (cmds) manages the file system namespace. Both metadata and data are stored in the object storage cluster, but the two are managed separately, supporting scalability. In fact, metadata is further split on a metadata server cluster. The metadata server can automatically copy and allocate namespaces to avoid hot spots. 4. the metadata server manages the namespace section, which can overlap (for redundancy and performance. The ing from the metadata server to the namespace is executed using the dynamic subtree logical partition in CEpH, which allows CEpH to adjust the changed workload (migrate the namespace between the metadata server) at the same time, the performance is retained.




Figure 4. CEpH namespace partition of the metadata server



However, because each metadata server simply manages the client population namespace, its main application is a smart metadata cache (because the actual metadata is ultimately stored in the object storage cluster ). The metadata for write operations is cached in a short-term log and eventually pushed into the physical memory. This action allows the metadata server to return recent metadata to the customer (which is common in metadata operations ). This log is also useful for fault recovery: If the metadata server fails, its logs will be replayed to ensure the secure storage of metadata on the disk.



The metadata server manages inode spaces and converts file names to metadata. The metadata server converts the file name to an index node, the file size, and the segmented data (layout) that the CEpH client uses for file I/O ).



CEpH Monitor



CEpH includes a monitor that implements cluster ing management. However, some elements of fault management are implemented in the object storage service itself. When an object storage device fails or a new device is added, the monitor detects and maintains an effective cluster ing. This function is executed in a distributed manner. In this way, the ing upgrade can communicate with the current traffic. CEpH uses paxos, which is a series of distributed consensus algorithms.



CEpH



Similar to traditional Object Storage, CEpH storage nodes include not only storage, but also intelligence. The traditional driver is a simple target that only responds to commands from the initiator. However, an object storage device is a smart device that serves as the target and initiator and supports communication and cooperation with other object storage devices.



From the storage point of view, the CEpH Object Storage Device executes the ing from objects to blocks (tasks that are often executed at the file system layer of the client ). This action allows the local object to determine how to store an object in the best way. Earlier versions of CEpH are namedEbofsTo implement a custom low-level file system. This system implements a non-standard interface to the underlying storage, which has been optimized for object semantics and other features (such as asynchronous notifications submitted to the disk. Today, the B-tree File System (btrfs) can be used for storage nodes. It has implemented some necessary functions (such as embedded integrity ).



Because CEpH implements crush and has no knowledge about the file ing blocks on the disk, the following storage devices can safely manage the ing between objects and blocks. This allows storage nodes to copy data (when a device fails ). Distributed fault recovery also allows storage system expansion because Fault Detection and recovery are distributed across ecosystems. CEpH calls it rados (see figure 3 ).



Back to Top



Other interesting Functions



If the file system is not dynamic and adaptive enough, CEpH also executes some interesting and visualized functions. You can create snapshots, for example, in any sub-directory of CEpH (including all content ). File and capacity computing can be executed at the subdirectory level. It reports the storage size and number of files for a given subdirectory (and its contents.



Back to Top



CEpH's position and future



Although CEpH is currently integrated in the main Linux kernel, it is marked as experimental. In this state, the file system is useful for testing, but is not prepared for the production environment. However, considering that CEpH has been added to the Linux kernel and the motivation of its creators to continue development, CEpH will soon be able to meet your massive storage needs.



Back to Top



Other Distributed File Systems



CEpH is not unique in Distributed File System space, but it is unique in managing the large-capacity storage ecosystem. Other examples of distributed file systems include Google File System (GFS), general parallel file system (gpfs), and lustre. The idea behind CEpH provides an interesting future for distributed file systems, because massive storage poses the only challenge to massive storage.



Back to Top



Future Prospects



CEpH is not just a file system, but also an object storage ecosystem with enterprise-level functions. In the references section, you will find how to set up a simple CEpH cluster (including metadata servers, object storage servers, and monitors. CEpH fills the gap in distributed storage. It will be interesting to see how this open-source product will evolve in the future.





References



Learning


  • The CEpH creator's essay "CEpH: A scalable, High-Performance Distributed File System" (PDF) and sage Weil's PhD paper, "CEpH: reliable, scalable, and high-performance distributed storage (PDF) reveals the original concept behind CEpH.


  • The Pb-level storage website of Storage Systems Research Center provides other technical information about CEpH.
  • Visit the CEpH homepage for the latest information.
  • "Crush: controlled, scalable, decentralized placement of replicated data" (PDF) and "rados: A scalable, reliable storage service for petabyte-Scale Storage clusters" (PDF) two of the most interesting aspects of the CEpH file system are discussed.
  • The CEpH File System on lwn.net provides a trial of the CEpH File System (including a series of interesting comments ).
  • "Building a small CEpH cluster" describes how to build a CEpH cluster and asset allocation techniques. This articleArticleIt helps you understand how to obtain CEpH resources, build a new kernel, and then deploy various elements of the CEpH ecosystem.
  • On the paxos Wikipedia page, learn more about how the CEpH metadata server uses paxos as a consensus protocol between multiple distributed entities.
  • Learn more about VFS in "Linux Virtual System File switch profiling" (developerworks, September 2009). VFS is a flexible mechanism in Linux that allows coexistence of multiple file systems.
  • Learn more about exofs-another Linux File System Using Object Storage-in "next-generation Linux File Systems: nilfs (2) and exofs" (developerworks, October 2009. Exofs maps the Storage Based on Object Storage devices to a traditional Linux File System.
  • On the btrfs kernel wiki website and "Linux kernel development" (developerworks, March 2008), you can learn how to use btrfs on an independent Object Storage node.
  • In the developerworks Linux area, find more references for Linux developers (including new Linux beginners) and refer to our most popular articles and tutorials.
  • Read all Linux tips and Linux tutorials on developerworks.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.