1. ceph File System Overview Ceph was initially a PhD research project on storage systems, implemented by SageWeil in UniversityofCalifornia and SantaCruz (UCSC. Ceph is an open source distributed storage and part of the main Linux kernel (2.6.34. 1) Ceph architecture C
1. ceph File System Overview
Ceph was initially a PhD research project on storage systems, implemented by Sage Weil at the University of California, Santa Cruz (UCSC.
Ceph is an open-source distributed storage and part of the main Linux kernel (2.6.34.
1) Ceph Architecture
The Ceph ecosystem can be roughly divided into four parts (see figure 1): client (data user), metadata server (Cache and synchronization of distributed metadata ), an object storage cluster (storing data and metadata as objects for other key functions) and the final cluster Monitor (for monitoring ).
Figure 1 Ceph Ecosystem
1. The customer uses the metadata server to perform metadata operations (to determine the data location ). The metadata server manages data locations and where new data is stored. It is worth noting that metadata is stored in a storage cluster (marked as "metadata I/O "). The actual file I/O occurs between the customer and the OSS cluster. In this way, higher-level POSIX functions (such as opening, closing, and renaming) are managed by the metadata server, but POSIX functions (such as reading and writing) are directly managed by the OSS cluster.
Another architecture view is provided by Figure 2. A series of servers access the Ceph ecosystem through a customer interface, which understands the relationship between the metadata server and Object-level storage. The distributed storage System can be viewed at some layers, including the format of a storage device (Extent and B-tree-based Object File System [EBOFS] or an alternative ), there is also a design designed for managing data replication, fault detection, recovery, and subsequent data migration overwrites the management layer, called Reliable Autonomic Distributed Object Storage (RADOS ). Finally, the monitor is used to identify component faults, including subsequent notifications.
Figure 2 ceph architecture View
2) Ceph component
After learning about the Ceph concept architecture, You can explore another layer to learn about the main components implemented in Ceph. One of the important differences between Ceph and traditional file systems is that it uses intelligence in the ecological environment rather than the file system itself.
Figure 3 shows a simple Ceph ecosystem. The Ceph Client is a user of the Ceph file system. Ceph Metadata Daemon provides a Metadata server, while Ceph Object Storage Daemon provides actual Storage (for both data and Metadata ). Finally, Ceph Monitor provides cluster management. It should be noted that Ceph customers, Object Storage endpoints, metadata servers (depending on the capacity of the file system) can have many, and at least have a pair of redundant monitors. How is the file system distributed?
Figure 3 simple Ceph Ecosystem
3) Ceph Client
Because Linux displays a public interface of the file system (through the Virtual File System switch [VFS]), the Ceph user perspective is transparent. The Administrator's perspective must be different, considering that many servers may contain the storage system (for more information about how to create a Ceph cluster, see references ). From the user's point of view, they access a large-capacity storage system, but do not know the metadata servers, monitors, and independent object storage devices aggregated into a large storage pool. The user simply sees an installation point, at which point the standard file I/O can be executed.
Ceph file system-or at least client interface-is implemented in Linux kernel. It is worth noting that in most file systems, all control and intelligence are executed in the Kernel File System source. However, in Ceph, the file system is intelligently distributed on nodes, which simplifies client interfaces and provides large-scale (or even dynamic) Expansion capabilities for Ceph.
Ceph uses an interesting alternative, rather than the dependency allocation list (ing blocks on the disk to the metadata of the specified file ). In the Linux perspective, a file is allocated to an inode number (INO) from the metadata server, which is a unique identifier for the file. Then the file is pushed into some objects (based on the file size ). Using INO and object number (ONO), each object is assigned an object ID (OID ). Use a simple hash on the OID, and each object is assigned to a placement group. A placement group (identified as PGID) is a conceptual container of objects. Finally, the ing between placement groups and object storage devices is a pseudo-random ing that uses an algorithm called Controlled Replication Under Scalable Hashing (CRUSH. In this way, the ing between placement groups (and replicas) and storage devices does not depend on any metadata, but on a pseudo-random ing function. This operation is ideal because it minimizes the storage overhead and simplifies allocation and data query.
The final component of the allocation is the cluster ing. Cluster ing is a valid representation of devices and displays the storage cluster. With PGID and cluster ing, you can locate any object.
4) Ceph metadata server
The metadata server (cmds) manages the file system namespace. Both metadata and data are stored in the object storage cluster, but the two are managed separately, supporting scalability. In fact, metadata is further split on a metadata server cluster. The metadata server can automatically copy and allocate namespaces to avoid hot spots. 4. the metadata server manages the namespace section, which can overlap (for redundancy and performance. The ing from the metadata server to the namespace is executed using the dynamic subtree logical partition in Ceph, which allows Ceph to adjust the changed workload (migrate the namespace between the metadata server) at the same time, the performance is retained.
Figure 4 Ceph namespace partition of the metadata server
However, because each metadata server simply manages the client population namespace, its main application is a smart metadata cache (because the actual metadata is ultimately stored in the object storage cluster ). The metadata for write operations is cached in a short-term log and eventually pushed into the physical memory. This action allows the metadata server to return recent metadata to the customer (which is common in metadata operations ). This log is also useful for fault recovery: If the metadata server fails, its logs will be replayed to ensure the secure storage of metadata on the disk.
The metadata server manages inode spaces and converts file names to metadata. The metadata server converts the file name to an index node, the file size, and the segmented data (layout) that the Ceph client uses for file I/O ).
5) Ceph Monitor
Ceph includes a monitor that implements cluster ing management. However, some elements of fault management are implemented in the object storage service itself. When an object storage device fails or a new device is added, the monitor detects and maintains an effective cluster ing. This function is executed in a distributed manner. In this way, the ing upgrade can communicate with the current traffic. Ceph uses Paxos, which is a series of distributed consensus algorithms.
6) Ceph
Similar to traditional Object Storage, Ceph storage nodes include not only storage, but also intelligence. The traditional driver is a simple target that only responds to commands from the initiator. However, an object storage device is a smart device that serves as the target and initiator and supports communication and cooperation with other object storage devices.
From the storage point of view, the Ceph Object Storage Device executes the ing from objects to blocks (tasks that are often executed at the file system layer of the client ). This action allows the local object to determine how to store an object in the best way. Earlier versions of Ceph implemented a custom low-level file system on a local memory named EBOFS. This system implements a non-standard interface to the underlying storage, which has been optimized for object semantics and other features (such as asynchronous notifications submitted to the disk. Today, the B-tree File System (BTRFS) can be used for storage nodes. It has implemented some necessary functions (such as embedded integrity ).
Because Ceph implements CRUSH and has no knowledge about the file ing blocks on the disk, the following storage devices can safely manage the ing between objects and blocks. This allows storage nodes to copy data (when a device fails ). Distributed fault recovery also allows storage system expansion because Fault Detection and recovery are distributed across ecosystems. Ceph calls it RADOS.