Globally unified namespace
Glusterfs adopts a globally unified namespace design that aggregates disk and memory resources into a single virtual storage pool for management, in this namespace, standard protocols such as NFS and CIFS are used to access application data. Unlike other distributed file systems, glusterfs does not have a dedicated metadata server. Instead, glusterfs uses a unique metadata-free service design and uses algorithms to locate files, metadata and data are not separated but stored together. This makes data access completely parallel, so as to achieve real linear performance expansion. The absence of data servers greatly improves the performance, reliability, and stability of glusterfs.
Glusterfs is a modular stack-based architecture design. The module is called a translator. Translators is a powerful File System Function Extension Mechanism provided by glusterfs. This design concept draws on the GNU/Hurd microkernel operating system. All functions of glusterfs are implemented through the translator mechanism. With this well-defined interface, you can easily and efficiently extend the functions of the file system. In glusterfs, each translator has its own global namespace and uses its own mechanism for independent maintenance and management.
Three cluster Modes
Glusterfs is a cluster file system. It mainly has three basic cluster modes: distributed cluster, stripe cluster, and replica cluster ). These three basic clusters can also form more complex composite clusters, such as distributed stripe cluster and distributed replica cluster), raid10 cluster (stripe replica cluster), distributed raid10 cluster (distributed
Stripe replica cluster ).
Each of the three basic clusters is implemented by a translator, which has its own namespace, as shown in. For distributed clusters, files are distributed to cluster nodes through the hash algorithm. The namespaces on each node do not overlap. All clusters form a complete namespace, use the hash algorithm for search and location during access. The replication cluster is similar to raid1. All nodes have the same namespace. Each node can represent a complete namespace. You can select any node for access. The strip cluster is similar to raid0. All nodes have the same namespace, but the object attributes are different. Files are divided into data blocks and distributed to all nodes in the round robin mode, all nodes need to be linked during access to obtain complete name information. For VFS lookup,
Stat and readdir are three namespace-related operations. The three cluster processing methods are as follows.
(1) distribued Cluster
. Lookup: select a node using the hash algorithm. If the node fails and is a directory, all nodes under the volume are queried. If the node is not found and search_unhashed is set, all nodes are traversed;
. Stat: requests are sent to all nodes. If a directory is used, attributes need to be aggregated;
. Readdir: Query all nodes and aggregate the file directory information and attributes;
(2) stripe Cluster
. Lookup: All nodes must be queried for Attribute aggregation, gfid check, and self-repair;
. Stat: Query all nodes for information aggregation;
. Readdir: query the first node. attribute information must be aggregated for all nodes;
(3) replica Cluster
. Lookup: requests are sent to all nodes. The first successful response is returned;
. Stat: query the selected up node. If the node fails, query the next node in sequence;
. Readdir: Same as Stat. Select an up node for query. If it fails, query the next up node in sequence;
Distributed Cluster
This cluster is also called an elastic hash volume. It uses algorithms to locate data, which is the basis and biggest feature of the entire glusterfs architecture, any server or client in the cluster can locate, read, and write data based on the path and file name. Therefore, file location can be independently parallel. In glusterfs, metadata does not need to be separated from data. File metadata is recorded in inode and extended attributes of the underlying file system. The hash distribution of glusterfs is based on the directory. The parent directory of the file records the subvolume ing information using the extended attributes, and the subfiles are distributed in the storage server of the parent directory. Because the file directory stores the distribution information in advance, the new node does not affect the distribution of existing file storage. It will participate in the Storage Distribution Scheduling from the new directory.
. Lookup: Search for files using the hash algorithm
1) if path is the root directory, the first up volume (dht_first_up_subvol) is selected as the target volume;
2) Otherwise, use path as the input parameter to calculate the hash value (dht_hash_compute), obtain the hash distribution information from the extended attributes of the parent directory, and then find and locate the target volume;
3) if the target volume is found, search for path in the target volume. If no path is found and gf_dht_lookup_unhashed_on/gf_dht_lookup_unhashed_auto is set, all the volumes are searched;
4) if the target volume is not found and the path is the directory, search for it in all the volumes;
. Mkdir: Distributed to all sub-volumes, add nodes to participate in the distribution, and assign hash range
1) first create a directory in the target volume hashed_subvol;
2) send a request to create a directory to all other sub-volumes;
3) use the Selfheal mechanism (dht_selfheal_new_directory) to allocate hash ranges;
. Create: Distributed to the sub-volume distributed by the parent directory. New nodes do not participate in the distribution.
1) Calculate the hash value of the file name and find the target volume. If not, return;
2) If the idle capacity of the target volume is below the predetermined water level, create a file and return it;
3) Find the sub-volumes with idle capacity under the predetermined water level, create files on them, and create links on the target volume to point to the actual files;
Stripe Cluster
The namespace is composed of all sub-volumes. During lookup, requests are sent to all nodes. Attribute acquisition requires aggregation. If any node has a problem, the namespace and data are not accessible.
.Lookup: Requests are sent to all nodes for aggregation by the client. If all requests are successfully returned, only the above results are returned;
.Create/mkdir: Requests are sent to all nodes. All nodes have the same namespace, but their attributes are slightly different;
.Readdir: Request all directory items from the first node. The attribute must be queried for aggregation by all nodes;
.Readv: Calculate the node where the data is located based on the stripe size and the round algorithm, and send requests in parallel. The client re-organizes the data;
. Writev: Calculates the node and offset of the written data block. If all data blocks are successfully written, they are returned up;
Replica Cluster
All nodes have the same and fully equivalent namespace. By default, the first node acts as the master node. Requests are usually sent to all nodes, and all successful responses are sent to the upstream node.
.Lookup: The request is sent to all up nodes. If a successful response is returned, the response is returned. If sealheal is required, the change log recorded on xattr is used;
.Create/mkdir: The transaction is processed in afr_transaction mode. The request is sent to all up subnodes, and the response is received only after all successful responses are returned;
.Readdir: Select an up node to send a request, which has the function of Server Load balancer. If a failure occurs, it requests all other nodes in turn;
.Readv: Open all up nodes when open, select an up node for read when reading, and if it fails, try out all other nodes in sequence;
.Writev: The transaction is processed in afr_transaction mode. The request is sent to all up subnodes, and all successful responses are sent to the upstream node;