Explore Ceph file systems and ecosystems
M. Tim Jones, freelance writer
Introduction: Linux® continues to expand into scalable computing space, especially for scalable storage. Ceph recently joined the impressive file system alternatives in Linux, a distributed file system that allows for the addition of replication and fault tolerance while maintaining POSIX compatibility. Explore Ceph's architecture and learn how it provides fault-tolerant capabilities to simplify massive data management.
Mark this article!
Release Date: June 12, 2010
Level: Intermediate
Other language versions: English
Visit 5,726 views
Recommendation: 0 (add comment)
Average score (8 ratings total)Contact Timtim is one of the most popular and four authors of the works. Browse DeveloperWorks all articles on Tim. View Tim's profile, contact him and other authors, and other readers in My DeveloperWorks.
As an architect in the storage industry, I have a unique passion for file systems. These systems are used to store the user interface of the system, although they tend to provide a range of similar functions, but they can also provide significant differences in functionality. Ceph is no exception, it also provides some of the most interesting features you can find in the file system.
Ceph was originally a PhD research project on storage systems, implemented by Sage Weil in University of California, Santa Cruz (UCSC). But by the end of March 2010, you can find Ceph in the mainline Linux kernel (starting with version 2.6.34). Although Ceph may not be suitable for production environments, it is also useful for testing purposes. This article explores the Ceph file system and its unique features, which make it the most attractive alternative to scalable distributed storage.
Ceph Target
Why choose "Ceph"?
"Ceph" is a strange name for a filesystem that breaks the typical abbreviation trend that most people follow. The name is related to the mascot of the UCSC (the birthplace of Ceph), the mascot is "Sammy", a banana-colored slug, a shell-free mollusk in the head-and-foot category. These multi-tentacles head-footed animals provide a most figurative metaphor for a distributed file system.
Developing a Distributed File system requires multiple efforts, but it can be invaluable if the problem is solved accurately. The objectives of Ceph are simply defined as:
- Easily scalable to petabytes of capacity
- High performance for multiple workloads (input/output operations per second [IOPS] and bandwidth)
- High reliability
Unfortunately, these goals compete with each other (for example, scalability can degrade or inhibit performance or affect reliability). Ceph has developed some very interesting concepts (such as dynamic metadata partitioning, data distribution, and replication), which are discussed only briefly in this article. Ceph's design also includes fault tolerance for protecting a single point of failure, which assumes that large-scale (petabytes of storage) storage failures are a common phenomenon rather than an exception. Finally, its design does not assume a particular workload, but includes the ability to adapt to changing workloads and provide the best performance. It accomplishes all of these tasks with POSIX compatibility, allowing it to transparently deploy applications that currently rely on POSIX semantics (through Ceph-targeted improvements). Finally, Ceph is open source distributed storage and is part of the mainline Linux kernel (2.6.34).
Back to top of page
Ceph Architecture
Now, let's explore Ceph's architecture and the high-end core elements. Then I'll expand to another level, explaining some of the key aspects of Ceph and providing a more detailed discussion.
The Ceph ecosystem can be roughly divided into four parts (see Figure 1): The client (data user), the metadata server (cache and synchronous distributed metadata), an object storage cluster (storing data and metadata as objects, performing other key functions), and the final cluster monitor (performing monitoring functions).
Figure 1. The conceptual architecture of the Ceph ecosystem
As shown in 1, the customer uses the metadata server to perform metadata operations (to determine the data location). The metadata server manages the location of the data and where the new data is stored. It is worth noting that the metadata is stored in a storage cluster (labeled "Meta-data I/O"). The actual file I/O occurs between the customer and the object storage cluster. In this way, higher-level POSIX functionality (for example, open, close, rename) is managed by the metadata server, but POSIX features, such as read and write, are managed directly by the object storage cluster.
Another schema view is provided by Figure 2. A series of servers access the Ceph ecosystem through a customer interface, which understands the relationship between the metadata server and the object-level memory. Distributed storage systems can be viewed in a number of tiers, including the format of a storage device (Extent and b-tree-based Object File System [Ebofs] or an alternative), and a design for managing data replication, fault detection, recovery, and subsequent data migrations covering the management layer, called Reliable autonomic distributed Object Storage (RADOS). Finally, the monitor is used to identify component failures, including subsequent notifications.
Figure 2. A streamlined layered view of the Ceph ecosystem
Back to top of page
Ceph components
Once you understand the conceptual architecture of Ceph, you can dig to another level to understand the main components implemented in Ceph. One of the important differences between Ceph and traditional file systems is that it uses intelligence in an ecological environment rather than the file system itself.
Figure 3 shows a simple Ceph ecosystem. The Ceph Client is a user of the Ceph file system. Ceph Metadata Daemon provides a metadata server, and the Ceph Object Storage Daemon provides the actual storage (both for data and metadata). Finally, Ceph Monitor provides cluster management. It is important to note that Ceph customers, object storage endpoints, metadata servers (depending on the capacity of the file system) can have many, and at least a pair of redundant monitors. So, how is this filesystem distributed?
Figure 3: A simple Ceph ecosystem
Ceph Client
Kernel or user space
Earlier versions of Ceph took advantage of the filesystems of user space (FUSE), which pushed file systems into user spaces and greatly simplified their development. Today, however, Ceph has been integrated into the mainline kernel to make it faster because the user-space context switches are no longer needed for file system I/O.
Because Linux displays a common interface to the file system (via the virtual file system switch [VFS]), Ceph's user perspective is transparent. The perspective of the administrator must be different, considering that many servers contain a potential factor for storage systems (see the Resources section for more information on creating a Ceph cluster). From the user's point of view, they access large-capacity storage systems, but do not know the following aggregation into a large storage pool of metadata servers, monitors, and separate object storage devices. Users simply see an installation point where they can perform standard file I/O.
The Ceph file system-or at least the client interface-is implemented in the Linux kernel. It is important to note that in most file systems, all control and intelligence is performed in the kernel's file system source itself. However, in Ceph, the intelligence of the file system is distributed across nodes, which simplifies the client interface and provides Ceph with large-scale (even dynamic) scaling capabilities.
Ceph uses an interesting alternative instead of relying on the allocation list (which maps the blocks on the disk to the metadata of the specified file). A file in a Linux perspective is assigned to an inode number (INO) from the metadata server, which is a unique identifier for the file. The file is then pushed into some objects (depending on the size of the file). With INO and object number (ONO), each object is assigned to an object ID (OID). Using a simple hash on the OID, each object is assigned to a placement group. A placement group (identified as Pgid) is the conceptual container for an object. Finally, the mapping of the placement group to the object storage device is a pseudo-random mapping using an algorithm called controlled Replication under Scalable Hashing (CRUSH). This way, the mapping of placing groups (and replicas) to storage devices does not rely on any metadata, but rather relies on a pseudo-random mapping function. This is ideal because it minimizes the overhead of storage and simplifies allocation and data querying.
The final component of the assignment is the cluster map. The cluster map is a valid representation of the device and shows the storage cluster. With Pgid and cluster mappings, you can locate any object.
Ceph meta-Data server
The work of the metadata server (CMDS) is to manage the namespace of the file system. While both metadata and data are stored in the object storage cluster, they are managed separately to support extensibility. In fact, metadata is further split on a single metadata server cluster, and the metadata server is able to replicate and allocate namespaces adaptively to avoid hotspots. As shown in 4, the metadata server manages the namespace portion, which can overlap (for redundancy and performance). The metadata server-to-namespace mapping is performed using dynamic subtree logical partitioning in Ceph, which allows ceph to adjust for changing workloads (migrating namespaces between metadata servers) while preserving the performance location.
Figure 4 Partition of the Ceph namespace for the metadata server
But because each metadata server simply manages the namespace of the client population, its primary application is an intelligent metadata cache (because the actual metadata is ultimately stored in the object storage cluster). The metadata for the write operation is cached in a short-term log, which is eventually pushed into the physical memory. This action allows the metadata server to return the most recent metadata to the customer (which is common in metadata operations). This log is also useful for failback: If the metadata server fails, its logs are replayed to ensure that the metadata is securely stored on disk.
The metadata server manages the Inode space and transforms the file name into metadata. The metadata server transforms the file names into index nodes, file sizes, and segmented data (layouts) used by the Ceph client for file I/O.
Ceph Monitor
Ceph contains monitors that implement cluster mapping management, but some of the features of fault management are performed in the object store itself. When an object storage device fails or a new device is added, the monitor detects and maintains a valid cluster map. This function is performed in a distributed manner, in which the mapping upgrade can communicate with the current traffic. Ceph uses Paxos, which is a series of distributed consensus algorithms.
Ceph Object Storage
Similar to traditional object storage, Ceph storage nodes include not only storage, but also intelligence. The traditional driver is a simple target that responds only to commands from the initiator. But the object storage device is a smart device that can serve as a target and initiator to support communication and collaboration with other object storage devices.
From a storage perspective, the Ceph object storage device performs a mapping from objects to blocks (tasks that are often performed in the client's file system layer). This action allows the local entity to determine in the best way how to store an object. Earlier versions of Ceph implement a custom low-level file system on a local storage called Ebofs . The system implements a non-standard interface to the underlying storage that has been tuned for object semantics and other features, such as asynchronous notifications for disk submissions. Today, the B-tree file system (BTRFS) can be used for storage nodes, which have implemented some of the necessary functions (such as embedded integrity).
Because Ceph customers implement CRUSH and have no knowledge of the file mapping blocks on disk, the following storage devices can safely manage object-to-block mappings. This allows the storage node to replicate data (when a device fails). Allocation failure recovery also allows storage system expansion because fault detection and recovery are distributed across ecosystems. Ceph calls it a RADOS (see Figure 3).
Back to top of page
Other interesting features
If the dynamic and adaptive nature of the file system is not enough, Ceph also performs some interesting functions that the user is visually aware of. Users can create snapshots, for example, on any of Ceph's subdirectories (including all content). File and capacity calculations can be performed at the subdirectory level, which reports the storage size and number of files for a given subdirectory (and what it contains).
Back to top of page
Ceph's Status and future
Although Ceph is now integrated into the mainline Linux kernel, it is only identified as experimental. File systems in this state are useful for testing, but are not ready for production environments. But given that Ceph joins the Linux kernel and the motivation that its creators want to continue to develop, it should be able to solve your massive storage needs in the near future.
Back to top of page
Other Distributed File systems
Ceph is not unique in Distributed file system space, but it is unique in managing the high-capacity storage ecosystem. Other examples of distributed file systems include the Google file System (GFS), the General Parallel file System (GPFS), and Lustre, which only mentions a subset. The idea behind Ceph provides an interesting future for distributed file systems, because massive levels of storage are the only challenges to massive storage problems.
Back to top of page
Looking to the future
Ceph is not just a file system, but also an object storage ecosystem with enterprise-class functionality. In the Resources section, you will find information on how to set up a simple Ceph cluster, including a metadata server, an object storage server, and a monitor. Ceph fills in the gaps in distributed storage and it will be interesting to see how this open source product evolves in the future.
Resources
Learn
- The Ceph creator's paper "Ceph:a Scalable, high-performance distributed File System" (PDF) and the PhD dissertation for Sage Weil, "ceph:reliable, Scal Able, and High-performance distributed Storage "(PDF), reveals the original idea behind Ceph.
- Storage Systems Research Center's Petabyte Storage website provides additional technical information about Ceph.
- Visit the Ceph home page to get the latest information.
- "Crush:controlled, scalable, decentralized Placement of replicated Data" (PDF) and "rados:a Scalable, Reliable Storage Serv Ice for Petabyte-scale Storage Clusters "(PDF) discusses the two most interesting aspects of the Ceph file system.
- The Ceph file system on the lwn.net provides a trial of the Ceph file system (including a series of interesting comments).
- "Building a Small ceph Cluster" describes how to build a ceph cluster, as well as asset allocation techniques. This article gives you an idea of how to acquire Ceph resources, build a new kernel, and then deploy the various elements of the ceph ecosystem.
- On the Paxos Wikipedia page, learn more about how the Ceph metadata server leverages Paxos as a consensus protocol between multiple distributed entities.
- In the Linux virtual System file Exchanger profile (developerworks,2009 September), learn more about VFS, a flexible mechanism included in Linux that allows multiple file systems to coexist.
- In "Next Generation Linux file systems: NiLFS (2) and Exofs" (developerworks,2009 year October), learn more about exofs--Another Linux file system that uses object storage. Exofs maps An object-based storage device to a traditional Linux file system.
- In Btrfs's kernel wiki site and "development of the Linux kernel" (developerworks,2008 March), you can learn how to use BTRFS on a stand-alone object storage node.
- Find out more about our most popular articles and tutorials in the DeveloperWorks Linux zone for more reference materials for Linux developers, including beginners for Linux.
- Check out all Linux tips and Linux tutorials on the developerWorks.
- Stay tuned for DeveloperWorks technical activities and webcasts.
- Watch the DeveloperWorks Demo center, which includes product installation and setup demos for beginners, as well as advanced features for experienced developers.
Access to products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, try a product in a cloud environment, or spend a few hours in the IBM SOA Sandbox for people to learn how to implement Service oriented efficiently Architec Ture
Discuss
Join the My DeveloperWorks community. View developer-driven blogs, forums, groups, and wikis, and communicate with other DeveloperWorks users.
http://blog.csdn.net/langeldep/article/details/6575431
Ceph: An open source Linux petabyte Distributed File system