One, GlusterFS overview
1.1 Introduction to GlusterFS
GlusterFS is an open source distributed file system, which is mainly composed of a
storage server (BrickServer), a client and an NFS/Samba storage gateway (optional, choose to use according to your needs). It has strong horizontal scalability in storing data, and can support several PB-level storage capacity by expanding different nodes.
GlusterFS uses TCP/IP or InfiniBandRDMA network to gather scattered storage resources, provide storage services in a unified manner, and use a single global namespace to manage data.
Advantages compared with traditional distributed
Traditional distributed file systems mostly store metadata through a meta
server, which includes directory information and directory structure on storage nodes. This design is very efficient when browsing the catalog, but it also has some shortcomings, such as a single point of failure. Once the
metadata server fails, even if the node has high redundancy, the entire storage system will collapse.
The GlusterFS distributed file system is based on a meta-server design, with strong data horizontal scalability, high reliability and storage efficiency. GlusterFS supports TCP/IP and InfiniBandRDMA high-speed network interconnection. Clients can access data through the native GlusterFS protocol, and other terminals that do not run GlusterFS clients can use NFS/CIFS standard protocols to access data through the storage gateway.
1.2 Features of GlusterFS
Scalability and high performance
The Scale-Out architecture improves storage capacity and performance by adding storage nodes (disk, computing, and I/O resources can all be increased independently).
Gluster Elastic Hash solves the dependence of GlusterFS on the metadata
server. GlusterFS uses an elastic hash algorithm to locate data in the storage pool. GlusterFS can intelligently locate any data shards (store the data shards on different nodes), without checking the index or querying the metadata
server. This design mechanism realizes the horizontal expansion of storage, improves the single point of failure and performance bottleneck, and truly realizes parallel data access.
High availability
GlusterFS can automatically copy files (similar to RAID1) by configuring certain types of storage volumes, even if a node fails, it will not affect data access. When the data is inconsistent, the automatic repair function can restore the data to the correct state. The data repair is performed in the background in an incremental manner and does not occupy too much system resources.
Global unified namespace
The global unified namespace gathers all storage resources into a single virtual storage pool, shielding users and applications from physical storage information. Storage resources (similar to LVM) can be
The environment needs to expand or contract elastically. In a multi-node scenario, the global unified namespace can also perform load balancing based on different nodes, which greatly improves access efficiency.
Flexible volume management
GlusterFS stores data in logical volumes, which are independent from logical storage pools
Logical division. Logical storage pools can be added and removed online without business interruption.
Based on standard protocol
Gluster storage service supports NFS, CIFS, HTTP, FTP, SMB and Gluster native protocols, and is fully compatible with POSIX standards. Existing applications can access data in Gluster without any modification, and can also use dedicated APIs (more efficient), which is very useful when deploying Gluster in a public cloud environment.
1.3 GlusterFS terminology
Brick (storage block): Refers to the dedicated partition provided by the host for physical storage in the trusted host pool. It is the basic storage unit in GlusterFS and is also a storage directory provided externally on the server in the trusted storage pool.
Volume (logical volume): A logical volume is a collection of bricks. A volume is a logical device for data storage, similar to a logical volume in LVM. Most Gluster management operations are performed on volumes.
FUSE (FilesysteminUserspace): is a kernel module that allows users to create their own file system without modifying the kernel code.
VFS: The interface provided by the kernel space to the user space to access the disk.
Glusterd (Background Management Process): Runs on each node in the storage cluster.
1.4 Modular Stacked Architecture
GlusterFS adopts a modular and stacked architecture, and can configure a customized application environment according to requirements. By combining various modules, complex functions can be realized.
For example, the Replicate module can realize RAID1, and the Stripe module can realize RAID0. The combination of the two can realize RAID10 and RAID01, and at the same time obtain higher performance and reliability.
Insert picture description here
GlusterFS is a modular stacking architecture design. The module is called Translator. It is a powerful mechanism provided by GlusterFS. With this well-defined interface, the functions of the file system can be easily and efficiently extended.
The design of the server and the client is highly modular, and the module interface is compatible. The same translator can be loaded on the client and the server at the same time.
All functions in GlusterFS are implemented through translators. The client is more complicated than the server, so the focus of the function is mainly on the client.
Two, GlusterFS working principle
2.1 GlusterFS workflow
Insert picture description here
The workflow of GlusterFS is as follows:
Clients or applications access data through the mount point of GlusterFS.
The Linux system kernel receives and processes the request through VFSAPI.
VFS submits data to the FUSE kernel file system and registers an actual file system FUSE with the system, while FUSE file system submits data to the GlusterFS client through the /dev/fuse device file. The FUSE file system can be understood as a proxy.
After the GlusterFS client receives the data, the client processes the data according to the configuration file.
After being processed by the GlusterFS client, the data is transmitted to the remote GlusterFS
Server through the network, and the data is written to the server storage device.
2.2 Elastic HASH algorithm
The flexible HASH algorithm is a specific implementation of the Davies-Meyer algorithm. Through the HASH algorithm, a 32-bit integer range of hash values can be obtained. Assuming that there are N storage units Brick in the logical volume, the 32-bit integer range will be divided into N Consecutive subspaces, each space corresponds to a Brick. When a user or application program accesses a certain namespace, the HASH value is calculated for the namespace, and the Brick where the data is located is located according to the 32-bit integer space corresponding to the HASH value.
2.3 Volume types of GlusterFS
GlusterFS supports seven types of volumes, namely distributed volumes, striped volumes, replicated volumes, distributed striped volumes, distributed replicated volumes, striped replicated volumes and distributed striped replicated volumes. These seven types of volumes can satisfy different application pairs. High performance and high availability requirements.
Distribute volume: The file is distributed to all Brick Servers through the HASH algorithm. This volume is the basis of Glusterf; the file is hashed to different bricks according to the HASH algorithm, which actually only expands the disk space. If a disk is damaged, data will also be lost. It belongs to the file-level RAID0 and does not have fault tolerance.
Stripevolume: Similar to RAID0, files are divided into data blocks and distributed to multiple BrickServers in a polling manner. File storage is based on data blocks and supports large file storage. The larger the file, the higher the reading efficiency. .
Replica volume: Synchronize files to multiple bricks to have multiple file copies. It belongs to file-level RAID 1 and has fault tolerance. Because the data is scattered among multiple bricks, the read performance is greatly improved, but the write performance is reduced.
DistributeStripevolume: The number of BrickServers is a multiple of the number of stripes (the number of bricks distributed by data blocks), which has the characteristics of both distributed and stripe volumes.
Distribute Replica volume: The number of Brick Servers is a multiple of the number of mirrors (the number of data copies), which has the characteristics of both distributed and replicated volumes.
Stripe Replica volume: Similar to RAID 10, it also has the characteristics of stripe and replicated volumes.
Distribute Stripe Replicavolume: A composite volume of three basic volumes, usually used in Map Reduce-like applications.
2.3.1 Distributed Volume
Distributed volume is the default volume of GlusterFS. When creating a volume, the default option is to create a distributed volume. In this mode, the file is stored directly on a Server node. Use the local file system directly for file storage.
The file is not divided into blocks
Save HASH value through extended file attributes
The supported underlying file systems are EXT3, EXT4, ZFS, XFS, etc.
Features:
The files are distributed on different servers without redundancy.
Expand the size of a volume more easily and cheaply.
A single point of failure can cause data loss.
Rely on the underlying data protection.
2.3.2 Strip roll
Stripe mode is equivalent to RAID0. In this mode, the file is divided into N blocks (N strip nodes) according to the offset, and stored in each BrickServer node in a poll. The node stores each data block as an ordinary file in the local file system, and records the total number of blocks (Stripe-count) and the serial number of each block (Stripe-index) through extended attributes.
Features:
The data is divided into smaller pieces and distributed to different strips in the block server group.
Distribution reduces the load and smaller files accelerate the speed of access.
There is no data redundancy.
2.3.3 Copy volume
The replication mode, also known as AFR (AutoFileReplication), is equivalent to RAID1, that is, one or more copies of the same file are stored, and the same content and directory structure are stored on each node. . If the storage space on multiple nodes is inconsistent, the capacity of the lowest node will be taken as the total capacity of the volume according to the barrel effect.
Features:
All servers in the volume keep a complete copy.
The number of copies of the volume can be determined by the customer when it is created.
There are at least two block servers or more servers.
With redundancy.
2.3.4 Distributed striped volume
Distributed striped volumes take into account the functions of distributed volumes and striped volumes, and are mainly used for large file access processing. At least 4 servers are required to create a distributed striped volume.
When creating a volume, if the number of storage servers is equal to the number of stripes or replicates, then the created volume is a striped or replicated volume; if the number of storage servers is twice or more the number of stripes or replicates, then the number of storage servers will be created Distributed striped volume or distributed replicated volume.
2.3.5 Distributed replication volume
Distributed replicated volumes take into account the functions of distributed volumes and replicated volumes, and are mainly used when redundancy is required. Generally suitable for occasions with high storage security requirements, such as banks.