Original address: http://support.huawei.com/ecommunity/bbs/10253434.html
1. Glusterfs Overview
Glusterfs is the core of the Scale-out storage solution Gluster, an open source Distributed file system with strong scale-out capability to support petabytes of storage capacity and handle thousands of of clients through scaling. The Glusterfs aggregates the physically distributed storage resources with TCP/IP or InfiniBand RDMA networks, using a single global namespace to manage the data. The Glusterfs is based on a stackable user space design that provides superior performance for a variety of data loads.
Figure 1 Glusterfs a unified mount point
Glusterfs supports standard clients running standard applications on any standard IP network, as shown in 2, where users can access application data using standard protocols such as NFS/CIFS in a globally unified namespace. The glusterfs allows users to get rid of their original, high-cost, closed storage systems and to deploy a centrally managed, scale-out, virtualized storage pool that can be scaled to the TB/PB level using inexpensive storage devices. The main features of Glusterfs are as follows:
L Scalability and High performance
The Glusterfs leverages dual features to provide a high scale storage solution of several terabytes to petabytes. The Scale-out architecture allows for increased storage capacity and performance by simply adding resources, and disk, compute, and I/O resources can be increased independently, supporting high-speed network interconnects such as 10GbE and InfiniBand. The Gluster elastic hash (Elastic hash) eliminates the need for a glusterfs server, eliminating single points of failure and performance bottlenecks, and truly implements parallel data access.
L High Availability
Glusterfs can automatically replicate files, such as mirroring or multiple copies, to ensure that data is always accessible, even in the event of a hardware failure. The self-healing feature restores the data to the correct state, and the fix is performed incrementally in the background, with little performance load being generated. Instead of designing its own private data file format, Glusterfs uses the mainstream standard disk file system in the operating system (such as EXT3, ZFS) to store files, so data can be replicated and accessed using a variety of standard tools.
L Global Unified Namespace
The global unified namespace aggregates disk and memory resources into a single virtual storage pool that masks the underlying physical hardware for upper-level users and applications. Storage resources can be scaled elastically in the virtual storage pool as needed, such as expansion or contraction. When a virtual machine image is stored, there is no limit to the number of virtual image files stored, and thousands of virtual machines are sharing data through a single mount point. Virtual machine I/O can automatically load-balance on all servers within a namespace, eliminating frequent access hotspots and performance bottlenecks in San environments.
L Elastic Hashing algorithm
Instead of using a centralized or distributed Metadata Server index, Glusterfs uses an elastic hashing algorithm to locate data in the storage pool. In other scale-out storage systems, meta-data servers often lead to I/O performance bottlenecks and single point of failure issues. In Glusterfs, all storage systems in the Scale-out storage configuration can intelligently locate arbitrary data shards without having to view the index or query to other servers. This design mechanism completely parallelized the data access and realized the real linear performance expansion.
L Flexible Volume management
Data is stored in logical volumes, and logical volumes can be logically separated from the virtualized physical storage pool. Storage servers can be added and removed online without interruption to the application. Logical volumes can grow and shrink across all configuration servers, can be migrated on different servers for capacity balancing, or add and remove systems that can be done online. File system configuration changes can also be made online and applied in real time to accommodate changes in workload conditions or online performance tuning.
L Standards-based protocols
The Gluster Storage Service supports NFS, CIFS, HTTP, FTP, and Gluster native protocols, fully compatible with POSIX standards. Existing applications do not require any modification or use of a dedicated API to access data in Gluster. This is useful when deploying gluster in a public cloud environment, gluster the cloud service provider-specific API, and then provides a standard POSIX interface.
2. Design goals
Glusterfs's design ideas differ significantly from existing parallel/clustered/Distributed file systems. If there is no essential breakthrough in the design of glusterfs, it is difficult to compete with lustre, PVFS2, Ceph, and so on, not to mention the commercial file systems with many years of technical precipitation and market accumulation, such as GPFS, StorNext, ISILON, Ibrix, etc. Its core design objectives include the following three:
L Resilient Storage System (elasticity)
The resiliency of the storage system means that organizations can flexibly increase or decrease data storage and add or remove resources from storage pools based on business needs without disrupting system operation. One of the Glusterfs design goals is elasticity, which allows the dynamic deletion of data volumes, expansion or reduction of data volumes, additions and deletions of storage servers, etc., without compromising system uptime and business services. Glusterfs in earlier versions, some management needs to be interrupted, and the latest version of 3.1.X is resilient enough to meet the demands of applications requiring high storage-system resiliency, especially for cloud storage service systems. Glusterfs mainly through storage virtualization technology and logical volume management to achieve this design goal.
L Linear Scale-out (Linear scale-out)
Linear scaling is very difficult to implement for storage systems, typically a log logarithmic curve relationship between system scale expansion and performance improvement because of the corresponding load that consumes some of the performance gains. Today, many parallel/clustered/Distributed file systems have high scalability, luster storage nodes can reach more than 1000, the number of clients can reach more than 25000, this expansion is very powerful, but the lustre is not linear extension.
Vertical scaling (scale-up) is designed to increase the storage capacity or performance of a single node, often with a variety of theoretical or physical limitations that cannot meet storage requirements. Scale-out (scale-out) increases the capacity or performance of the entire system by increasing storage nodes, an extension that is currently a hotspot in storage technology that can effectively address storage requirements such as capacity and performance. Most of the current parallel/clustered/Distributed File systems have scale-out capabilities.
The glusterfs is a linear scale-out architecture that enables linear storage capacity and performance gains through scale-out storage nodes. Therefore, combined with the vertical expansion of the glusterfs can achieve multidimensional scalability, increase the disk per node to increase storage capacity, increase storage nodes can improve performance, thereby bringing more disk, memory, I/O resources to a larger capacity, higher performance virtual storage pool. The Glusterfs uses three basic techniques to achieve linear scale-out capabilities:
1) Eliminate meta-data services
2) Efficient data distribution for scalability and reliability
3) Maximize performance through parallelization of fully distributed architectures
L High reliability (reliability)
Similar to the GFS (Google File System), Glusterfs can be built on top of ordinary servers and storage devices, so reliability is especially critical. Glusterfs has incorporated reliability into its core design from the beginning of its design and has adopted a variety of technologies to achieve this design goal. First, it assumes that the fault is a normal event, including hardware, disk, network failure, and data corruption caused by mis-operation of the administrator. The Glusterfs design supports automatic replication and automatic repair to ensure data reliability without the need for administrator intervention. Second, Glusterfs uses the log functions of disk file systems such as the underlying EXT3/ZFS to provide some data reliability without reinventing the wheel itself. Thirdly, Glusterfs is a non-metadata server design, does not need the metadata synchronization or the consistency maintenance, greatly reduces the system complexity, not only improves the performance, but also greatly improves the system reliability.
3. Technical Features
Glusterfs is significantly different from traditional storage systems or other Distributed file systems in the implementation of technology, mainly in the following aspects.
L full software implementation (software only)
Glusterfs believes that storage is a software problem and cannot be limited to the use of specific vendors or hardware configurations to resolve them. Glusterfs is an open-plan design that supports industry-standard storage, networking, and computer equipment rather than bundled with customized, dedicated hardware devices. For commercial customers, Glusterfs can be delivered as a virtual appliance, packaged with a virtual machine container, or an image deployed in a public cloud. In the open source community, Glusterfs is heavily deployed on a variety of operating systems based on inexpensive, idle hardware, forming a centralized, unified pool of virtual storage resources. In short, Glusterfs is an open, full-software implementation that is completely independent of hardware and operating systems.
L Full storage OS stack (complete Storage Operating system stack)
Glusterfs not only provides a distributed file system, but also provides many other important distributed functions, such as distributed memory management, I/O scheduling, soft raid, and self-healing. Glusterfs draws lessons from the micro-kernel architecture, draws on the design idea of gnu/hurd operating system, and realizes the complete storage operating system stack in user space.
• User space Implementation
Unlike traditional file systems, Glusterfs is implemented in user space, which makes it particularly easy to install and upgrade. In addition, this also greatly reduces the standard user based on the source code modification Glusterfs threshold, only needs the general C programming skill, but does not need the special kernel programming experience.
L Modular Stack-up architecture (Modular stackable Architecture)
The Glusterfs features a modular, stack-based architecture that enables highly customized applications with flexible configurations such as large file storage, massive small file storage, cloud storage, multi-Transfer Protocol applications, and more. Each function is implemented as a module, and then a simple combination of bricks can be implemented to achieve complex functions. For example, the Replicate module enables the Raid1,stripe module to achieve RAID0, with a combination of both RAID10 and RAID01 for high performance and high reliability.
L Raw Data Format storage (data Stored in Native Formats)
Glusterfs stores data in raw data formats such as EXT3, EXT4, XFS, and ZFS, and implements a variety of automatic data repair mechanisms. As a result, the system is highly resilient and can be accessed by other standard tools even when offline. If users need to migrate data from Glusterfs, they can still use the data completely without any modifications.
L No Meta Data Service design (no Metadata with the Elastic Hash algorithm)
One of the biggest challenges for scale-out storage systems is to document the image relationship between the data logic and the physical location, which is the data metadata, which may also include information such as attributes and access rights. Traditional distributed storage systems use centralized or distributed metadata services to maintain metadata, and centralized metadata services can lead to single point of failure and performance bottlenecks, while distributed metadata services have performance load and metadata synchronization consistency issues. Especially for the application of large amount of small files, meta-data problem is a very big challenge.
Glusterfs is uniquely designed to use a non-meta-data service instead of using algorithms to locate files, and metadata and data are stored together instead of separating them. All storage-system servers in a cluster can intelligently locate file data shards, based only on filenames and paths and using algorithms, without querying the index or other servers. This allows data access to be fully parallelized for true linear performance scaling. The no-metadata server greatly improves the performance, reliability, and stability of the glusterfs.
4. Overall architecture and design
Figure 2 Glusterfs Architecture and composition
The Glusterfs overall architecture, as shown in component 2, consists primarily of storage servers (Brick server), clients, and Nfs/samba storage gateways. It is not difficult to discover that the Glusterfs architecture has no metadata server components, which is the biggest design point, which is of decisive significance for improving the performance, reliability and stability of the whole system. Glusterfs supports TCP/IP and InfiniBand RDMA high-speed network interconnection, clients can access data through the native Glusterfs protocol, and other terminals that do not run Glusterfs clients can access data through the storage gateway through the NFS/CIFS standard protocol.
The storage server mainly provides the basic data storage function, and the final file data is distributed on different storage servers through a unified scheduling policy. They run on GLUSTERFSD and handle data service requests from other components. As mentioned earlier, data is stored directly on the server's local file system in its original format, such as EXT3, EXT4, XFS, ZFS, and so on, specifying the data storage path when running the service. Multiple storage servers can be clustered by the volume manager on the client or the storage gateway, such as stripe (RAID0), Replicate (RAID1), and DHT (distributed hash) storage clusters, or they can be used to form more complex clusters, such as RAID10, with nested combinations.
Without the metadata server, the client takes on more functions, including data volume management, I/O scheduling, file location, data caching and so on. Running the Glusterfs process on the client, which is actually a symbolic link to GLUSTERFSD, uses the fuse (File system in User Space) module to mount the Glusterfs on top of the local file system, enabling a POSIX-compliant way to access system data. In the latest 3.1.X release, the client no longer needs to maintain the volume configuration information independently, changing it to automatically obtain and update from the Glusterd Elastic Volume Management Service running on the gateway, greatly simplifying volume management. Glusterfs client load is higher than the traditional distributed file system, including CPU utilization and memory consumption.
The Glusterfs Storage Gateway provides elastic volume management and nfs/cifs Access Proxy functionality, which runs the glusterd and glusterfs processes, both of which are GLUSTERFSD symbolic links. Volume Manager is responsible for the creation of logical volumes, deletion, capacity expansion and reduction, capacity smoothing and other functions, and is responsible for providing the client with logical volume information and proactive update notification function. The GlusterFS 3.1.X enables flexible and automated management of logical volumes without disrupting data services or upper-level application business. For Windows clients or clients that do not have Glusterfs installed, they need to be accessed through the NFS/CIFS Proxy network interfacing, when the gateway is configured as an NFS or Samba server. Compared with native clients, gateways are subject to Nfs/samba in performance.
Figure 3 Glusterfs Modular Stack-up design
Glusterfs is a modular stack-up architecture design, shown in 3. The module, known as translator, is a powerful mechanism provided by Glusterfs, which enables the efficient and easy expansion of file system functionality with this well-defined interface. The server is compatible with the client module interface, and the same translator can be loaded on both sides at the same time. Each translator is a so dynamic library that is dynamically loaded at runtime based on the configuration. Each module implements a specific basic function, and all functions in Glusterfs are implemented through translator, such as cluster, Storage, performance, Protocol, features, etc. The basic simple module can be combined with a stack to achieve complex functions. This design idea draws on the virtual file system design of Gnu/hurd micro-kernel, which can transform the access of the external system into the appropriate call of the target system. Most modules run on the client side, such as synthesizers, I/O schedulers, and performance optimizations, and the server is relatively simple. Both the client and the storage server have their own storage stacks, which form a translator function tree and apply several modules. Modular and stack-style architecture design greatly reduces system design complexity, simplifies system implementation, upgrades, and system maintenance.
5. Elastic hashing algorithm
For distributed systems, meta-data processing is the key to determine the scalability, performance and stability of the system. Glusterfs A new approach, completely abandoning the metadata service, using elastic hashing algorithm instead of centralized or distributed meta-data service in traditional distributed file system. This fundamentally solves the problem of metadata, resulting in near-linear high scalability, while also improving system performance and reliability. Glusterfs uses the algorithm to locate data, and any server and client in the cluster can locate and read and write access to the data based on the path and file name. In other words, Glusterfs does not need to separate the metadata from the data because the file location can be parallelized independently. The data access process in Glusterfs is as follows:
1, calculates the hash value, the input parameter is the file path and the filename;
2, according to the hash value in the cluster Select sub-volume (storage server), file positioning;
3, data access to the selected sub-volume.
Glusterfs currently uses the Davies-meyer algorithm to calculate the hash value of a file name, obtaining a 32-bit integer. The Davies-meyer algorithm has a very good hash distribution and high computational efficiency. Assuming that the storage servers in the logical volume have N, the 32-bit integer space is divided evenly into n contiguous subspace, each mapped to a storage server. In this way, the computed 32-bit hash value is projected onto a storage server, which is the sub-volume we want to select. Is it really so simple? Now let's consider the storage node join and delete, file renaming and so on, how to solve these problems and glusterfs how resilient?
A new storage node is added to the logical volume, and if nothing else is done, the hash value mapping space will change, and the existing file directory may be relocated to another storage server, causing the location to fail. The solution is to redistribute the file directory and move the files to the correct storage server, but this greatly increases the load on the system, especially for mass storage systems that already have a large amount of data stored. Another method is to use a consistent hashing algorithm, modify the new node and the adjacent node hash mapping space, only need to move the part of the adjacent node data to the new node, the impact is relatively small. However, this brings another problem, that is, the overall load imbalance of the system. Glusterfs does not adopt the above two methods, but designs a more flexible algorithm. The hash distribution of Glusterfs is based on the directory, and the parent directory of the file uses the extended attribute to record the sub-volume mapping information, and its next-to-face file directory is distributed on the parent directory's owning storage server. Because the file directory stores the distribution information beforehand, the new node does not affect the existing file storage distribution, and it will participate in the storage distribution schedule starting from the newly created directory thereafter. This design, the new node does not need to move any files, but load balancing is not smooth processing, the old node load is heavier. Glusterfs in the design of this problem, in the new file will be given priority to the capacity of the lightest node, on the target storage node to create a file link straight to the node that really stores the file. In addition, the Glusterfs Elastic Volume management tool can manually perform load smoothing in the background, moving and re-distributing files, and all storage servers will be dispatched thereafter.
Glusterfs currently has limited support for storage node deletion and is unable to achieve complete unattended intervention. If you delete the node directly, the files on the storage server will not be able to browse and access, and creating the file directory will fail. There are currently two manual workarounds, one is to re-copy the data on the node to Glusterfs, and the other is to use the new node to replace the deleted node and maintain the original data.
If a file is renamed, it is obvious that the hash algorithm will produce different values, and it is very likely that the file is located on a different storage server, resulting in file access failure. The use of data movement method, for large files is difficult to complete in real-time. In order not to affect performance and service disruption, Glusterfs uses a file link to address file renaming, creating a link to the actual storage server on the target storage server, and the system parsing and redirecting the access. In addition, the background file migration, after the successful file links will be automatically deleted. Similar treatment for file movement, the advantage is that the foreground operation can be processed in real time, physical data migration is placed in the background to choose the right time to execute.
Figure 4 Glusterfs Elastic Volume management
Elastic hashing algorithms allocate logical volumes to files, so how do glusterfs allocate physical volumes for logical volumes? The glusterfs3.1.x achieves true elastic volume management, shown in 4. Storage volumes are abstractions of the underlying hardware that can be scaled up and down as needed, and migrated between different physical systems. Storage servers can be added and removed online, and the data is automatically load balanced between clusters, and data is always available online with no application disruption. File system configuration Updates can also be performed online, and configuration changes can be quickly and dynamically propagated in the cluster, automatically adapting to load fluctuations and performance tuning.
The elastic hashing algorithm itself does not provide data fault tolerance, Glusterfs uses mirroring or replication to ensure data availability, and it is recommended to use mirroring or 3-way replication. In replication mode, the storage server uses synchronous write replication to other storage servers, and a single server failure is completely transparent to the client. In addition, Glusterfs does not limit the number of copies, and reads are dispersed across all of the mirrored storage nodes to improve read performance. The elastic hashing algorithm allocates files to a unique logical volume, while replication guarantees that the data is kept at least two different storage nodes, combining the two to make the glusterfs more resilient.
6. Translators
As mentioned earlier, translators is a powerful file system extension mechanism provided by Glusterfs, which draws on the Gnu/hurd microkernel operating system. All the functions in Glusterfs are implemented through the translator mechanism, and the runtime is loaded in a dynamic library, and the server and client are compatible with each other. GlusterFS 3.1.X, mainly includes the following types of translator:
(1) Cluster: storage cluster distribution, currently has AFR, DHT, stripe three ways
(2) Debug: Trace Glusterfs internal functions and system calls
(3) Encryption: Simple Data encryption implementation
(4) Features: Access control, lock, Mac compatible, silent, quota, read only, Recycle Bin, etc.
(5) Mgmt: Elastic Volume management
(6) Mount:fuse interface implementation
(7) NFS: Internal NFS Server
(8) Performance:io-cache, Io-threads, Quick-read, Read-ahead, Stat-prefetch, Sysmlink-cache, Write-behind and other performance optimization
(9) Protocol: Server and client protocol implementation
(Storage): the implementation of the POSIX interface of the underlying file system
Here we highlight cluster translators, which is the core of implementing Glusterfs cluster storage, which includes AFR (Automatic File Replication), DHT (distributed Hash Table) and stripe of three types.
AFR is equivalent to RAID1, which retains multiple copies of the same file on multiple storage nodes, primarily for high availability and automatic data repair. AFR has the same namespace on all the sub-volumes and finds the file from the first node until the search succeeds or the last node search is complete. When reading data, AFR will dispatch all requests to all storage nodes for load balancing to improve system performance. When writing data, you first need to lock the file on all the lock servers, the first node is the lock server, you can specify more than one. AFR then writes the data to all the servers as a log event, deleting the log and unlocking it after success. AFR automatically detects and repairs the data inconsistency of the same file, using the change log to determine a good copy of the data. Automatic repair is triggered when the file directory is first accessed, if the directory will copy the correct data on all the sub-volumes, if the file is not saved then created, the file information does not match the repair, the log indicates that the update is updated.
DHT is the elastic hashing algorithm described above, which uses hash method to distribute data, and namespaces are distributed across all nodes. When locating a file, it is done through an elastic hashing algorithm, and does not rely on namespaces. However, when traversing the file directory, it is more complex and inefficient to search all the storage nodes. A single file is dispatched only to a unique storage node, and the read-write mode is relatively straightforward once the file is located. DHT does not have fault tolerance and requires the use of AFR for high availability, with 5 application cases.
Stripe is equivalent to RAID0, a shard store, where files are partitioned into fixed-length data shards that are stored in round-robin rotation in all storage nodes. Stripe all storage nodes make up the full namespace, it is very inefficient to ask for all nodes when looking for a file. When reading and writing data, stripe involves all of the Shard storage nodes, and the operation can be executed concurrently between multiple nodes with high performance. Stripe is typically used in combination with the AFR to form a raid10/raid01, while obtaining high performance and high availability, of course with less than 50% storage utilization.
Figure 5 Glusterfs Application case: AFR+DHT
7. Design Discussion
The Glusterfs is a highly scalable, high-performance, high-availability, scale-out, elastic Distributed file system that is very characteristic of architecture design, such as the non-metadata server design, the stack architecture, and so on. However, the storage application problem is very complex, Glusterfs also can not meet all the storage requirements, design implementation also have to consider the shortcomings, below we make a brief analysis.
l no meta data server vs Meta data server
The benefit of a no-metadata server design is that there is no single point of failure and performance bottlenecks that can improve system scalability, performance, reliability, and stability. For a large amount of small file applications, this design can effectively solve the problem of meta-data. Its negative effect is that the data consistency problem is more complex, the file directory traversal operation is inefficient, the lack of global monitoring and management functions. It also causes the client to take on more functions, such as file positioning, namespace caching, logical volume view maintenance, and so on, which increase the load on the client and occupy a considerable amount of CPU and memory.
• User space vs. kernel space
User space is much simpler to implement, less demanding for developer skills, and relatively safe to run. The user space efficiency is low, the data needs to exchange with the kernel space many times, in addition Glusterfs uses the fuse to realize the standard file system interface, the performance also has the loss. Kernel space implementation can achieve high data throughput, the disadvantage is that implementation and debugging is very difficult, program errors often lead to system crashes, security is low. On the vertical scale, the kernel space is superior to the user space, and the glusterfs has the ability to scale up.
L Stack vs non-stacked
This is a bit like operating system micro-kernel design and single kernel design contention. Glusterfs stack design idea from the GNU/HURD micro-core operating system, with strong system scalability, system design to reduce the complexity of a lot of basic function module stack combination can achieve a powerful function. View the Glusterfs volume profile we can see that the translator function tree is usually 10 layers deep, the first layer of the call, the efficiency is visible. The non-stack design can be seen as a single kernel design similar to Linux, and system calls are implemented with interrupts and are highly efficient. The latter problem is the core of the system is bloated, the implementation and expansion of complex, problems debugging difficulties.
L RAW storage format vs Private storage format
Glusterfs uses raw format to store files or data shards, can be accessed directly using a variety of standard tools, data interoperability is good, migration and data management is very convenient. However, data security becomes a problem because the data is stored in an ordinary way, and the person who touches the data can copy and view it directly. This is obviously unacceptable for many applications, such as cloud storage systems, where users are particularly concerned about data security, which is an important reason for the development of public cloud storage. The private storage format guarantees the security of the data, even if the disclosure is not known. Glusterfs to implement its own private format, the design implementation and data management are relatively complex, but also have a certain impact on performance.
L Large files vs small files
is glusterfs suitable for large files or small file storage? Elastic hashing algorithm and stripe data distribution strategy, removing metadata dependency, optimizing data distribution, improving data access parallelism, can greatly improve the performance of large file storage. For small files, metadata-free service design solves the problem of meta-data. However, Glusterfs is not optimized for I/O, and is still a large number of small files on the storage server's underlying filesystem, and local file system metadata access is a bottleneck, and data distribution and parallelism are not fully functional. Therefore, Glusterfs is suitable for storing large files, small file performance is poor, there is a great space for optimization.
L Availability vs Storage utilization
Glusterfs uses replication technology to provide high availability of data, there is no limit to the number of copies, and automatic repair is implemented based on replication. Availability and storage utilization is a contradiction, availability is high storage utilization is low, and vice versa. Using replication technology, the storage utilization is 1/replicated, the image is 50%, and the three-way replication is only 33%. In fact, there are ways to improve availability and storage utilization at the same time, such as RAID5 utilization is (n-1)/n,raid6 is (n-2)/n, and the Erasure code technology can provide higher storage utilization. However, the fish and bear cake can not be combined, they will have a greater impact on performance.
In addition, Glusterfs current code implementation is not good enough, the system is not stable, the number of bugs relatively more. From the deployment of its official website, there are a lot of test users, but there are few applications in the production environment, the storage deployment capacity is a large percentage of tb-dozens of TB, and hundreds of TB-PB level cases are very small. It can also be explained from another aspect that Glusterfs is not stable enough and takes longer to test. Admittedly, Glusterfs is a cluster file system with a bright future, and linear scale-out capabilities make it inherently advantageous, especially for cloud storage systems.
glusterfs[Turn]