Design new xlator extension glusterfs

Source: Internet
Author: User
Tags map data structure posix symlink type null glusterfs gluster

1. glusterfs Overview

Glusterfs is an open-source Distributed File System with powerful scale-out capability. It supports Pb storage capacity and processing of thousands of clients. Glusterfs aggregates physically distributed storage resources using TCP/IP or InfiniBand rdma networks and uses a single global namespace to manage data. Based on the stackable user space design, glusterfs provides excellent performance for a variety of data loads.

Glusterfs supports standard clients for standard applications running on any standard IP network. You can use standard protocols such as glusterfs, NFS, and CIFS to access application data in a globally unified namespace. Glusterfs allows you to get rid of the original independent and high-cost closed storage system, and use ordinary low-cost storage devices to deploy storage pools that can be centrally managed, horizontally scaled, and virtualized, the storage capacity can be expanded to TB/Pb level. For more information about glusterfs, see "glusterfs Cluster File System Research. Glusterfs has the following main features:

1) scalability and High Performance
2) High Availability
3) Globally unified namespace
4) elastic Hash Algorithm
5) elastic volume Management
6) based on standard protocols

2. How xlator works

Glusterfs adopts a modular and stack-based architecture and supports highly customized application environments through flexible configuration, such as large file storage, massive and small file storage, cloud storage, and multi-transmission protocol applications. Each function is implemented in the form of modules, and then simply combined in the form of building blocks to implement complex functions. For example, the replicate module can implement raid1, the stripe module can implement raid0, and the combination of the two can achieve raid10 and raid01, while obtaining high performance and high reliability.

The glusterfs stack-based design concept originated from the GNU/Hurd microkernel operating system. It has strong system scalability and greatly reduces the complexity of system design and implementation, the stack combination of basic function modules can achieve powerful functions. The basic module is called translator. It is a powerful File System Function Extension Mechanism provided by glusterfs. With this well-defined interface, you can easily and efficiently expand the functions of the file system.

All functions of glusterfs are implemented through the translator mechanism. The server and client module interfaces are compatible. The same translator can be loaded on both sides at the same time. Each translator is a so dynamic library, which is dynamically loaded according to the configuration during runtime. Each module implements specific basic functions, such as cluster, storage, performance, protocol, and features. A simple module can implement complex functions through a stack combination, translator can convert access to external systems into appropriate calls to the target system. Most modules run on the client, such as synthesizer, I/O scheduler, and performance optimization. The server is much simpler. Both the client and the storage server have their own storage stacks, forming a translator function tree and applying several modules. Modular and stack architecture design greatly reduces the complexity of system design and simplifies system implementation, upgrade, and system maintenance.


Gluster volume translator stack Diagram

In the glusterfs concept, the complete function stack composed of a series of translators is called volume (as shown in). The local file system assigned to a volume is called Brick, the brick processed by at least one translator is called subvolume. The fuse module is located on the client, and the POSIX module is located on the server. It is usually the first or last module in volume and depends on the direction of data stream access. The median branch adds other functional modules to form a complete volume. These modules are combined by a graph. This is a multi-layer design. At runtime, messages are transmitted by calling adjacent module interfaces in an orderly manner. The call relationship is determined by each module based on its own functions and the translator diagram. Shows the complete data flow of the volume implemented by the translator.


Glusterfs data stream

3. xlator structure and related APIs

Xlator is a highly modular component with well-defined internal structures, including struct and interface function prototype definitions. Therefore, to implement a xlator, you must strictly follow the definition. Specifically, you must implement xlator. parameters and function pointers in struct such as xlator_fops, xlator_cbks, init, Fini, and volume_options defined in H are described as follows:

struct xlator_fops {        fop_lookup_t         lookup;        fop_stat_t           stat;        fop_fstat_t          fstat;        fop_truncate_t       truncate;        fop_ftruncate_t      ftruncate;        fop_access_t         access;        fop_readlink_t       readlink;        fop_mknod_t          mknod;        fop_mkdir_t          mkdir;        fop_unlink_t         unlink;        fop_rmdir_t          rmdir;        fop_symlink_t        symlink;        fop_rename_t         rename;        fop_link_t           link;        fop_create_t         create;        fop_open_t           open;        fop_readv_t          readv;        fop_writev_t         writev;        fop_flush_t          flush;        fop_fsync_t          fsync;        fop_opendir_t        opendir;        fop_readdir_t        readdir;        fop_readdirp_t       readdirp;        fop_fsyncdir_t       fsyncdir;        fop_statfs_t         statfs;        fop_setxattr_t       setxattr;        fop_getxattr_t       getxattr;        fop_fsetxattr_t      fsetxattr;        fop_fgetxattr_t      fgetxattr;        fop_removexattr_t    removexattr;        fop_lk_t             lk;        fop_inodelk_t        inodelk;        fop_finodelk_t       finodelk;        fop_entrylk_t        entrylk;        fop_fentrylk_t       fentrylk;        fop_rchecksum_t      rchecksum;        fop_xattrop_t        xattrop;        fop_fxattrop_t       fxattrop;        fop_setattr_t        setattr;        fop_fsetattr_t       fsetattr;        fop_getspec_t        getspec;        /* these entries are used for a typechecking hack in STACK_WIND _only_ */        fop_lookup_cbk_t         lookup_cbk;        fop_stat_cbk_t           stat_cbk;        fop_fstat_cbk_t          fstat_cbk;        fop_truncate_cbk_t       truncate_cbk;        fop_ftruncate_cbk_t      ftruncate_cbk;        fop_access_cbk_t         access_cbk;        fop_readlink_cbk_t       readlink_cbk;        fop_mknod_cbk_t          mknod_cbk;        fop_mkdir_cbk_t          mkdir_cbk;        fop_unlink_cbk_t         unlink_cbk;        fop_rmdir_cbk_t          rmdir_cbk;        fop_symlink_cbk_t        symlink_cbk;        fop_rename_cbk_t         rename_cbk;        fop_link_cbk_t           link_cbk;        fop_create_cbk_t         create_cbk;        fop_open_cbk_t           open_cbk;        fop_readv_cbk_t          readv_cbk;        fop_writev_cbk_t         writev_cbk;        fop_flush_cbk_t          flush_cbk;        fop_fsync_cbk_t          fsync_cbk;        fop_opendir_cbk_t        opendir_cbk;        fop_readdir_cbk_t        readdir_cbk;        fop_readdirp_cbk_t       readdirp_cbk;        fop_fsyncdir_cbk_t       fsyncdir_cbk;        fop_statfs_cbk_t         statfs_cbk;        fop_setxattr_cbk_t       setxattr_cbk;        fop_getxattr_cbk_t       getxattr_cbk;        fop_fsetxattr_cbk_t      fsetxattr_cbk;        fop_fgetxattr_cbk_t      fgetxattr_cbk;        fop_removexattr_cbk_t    removexattr_cbk;        fop_lk_cbk_t             lk_cbk;        fop_inodelk_cbk_t        inodelk_cbk;        fop_finodelk_cbk_t       finodelk_cbk;        fop_entrylk_cbk_t        entrylk_cbk;        fop_fentrylk_cbk_t       fentrylk_cbk;        fop_rchecksum_cbk_t      rchecksum_cbk;        fop_xattrop_cbk_t        xattrop_cbk;        fop_fxattrop_cbk_t       fxattrop_cbk;        fop_setattr_cbk_t        setattr_cbk;        fop_fsetattr_cbk_t       fsetattr_cbk;        fop_getspec_cbk_t        getspec_cbk;};struct xlator_cbks {        cbk_forget_t    forget;        cbk_release_t   release;        cbk_release_t   releasedir;};void             (*fini) (xlator_t *this);int32_t           (*init) (xlator_t *this);typedef struct volume_options {        char                *key[ZR_VOLUME_MAX_NUM_KEY];        /* different key, same meaning */        volume_option_type_t type;        int64_t              min;  /* 0 means no range */        int64_t              max;  /* 0 means no range */        char                *value[ZR_OPTION_MAX_ARRAY_SIZE];        /* If specified, will check for one of           the value from this array */        char                *default_value;        char                *description; /* about the key */} volume_option_t;

The function pointers in the xlator_fops and xlator_cbks struct must be strictly defined in xlator. h. Xlator_fops is a combination of file_operations, inode_operations, and super_operatioins in Linux. In addition, the above struct and function pointer names are determined to be fops, cbks, init, Fini, options, and cannot be changed. Because xlator is finally provided to the glusterfs main program in the form of so dynamic library, you need to use a unified name to load and locate function pointers and variables in xlator. Init,
Fini is used for the processing of xlator loading and unloading, which is very useful for personalized private data processing of each xlator. If the interfaces and parameters provided by the xlator template cannot meet the requirements, these two interfaces can be effectively used for processing. It is worth mentioning that xlator does not necessarily implement all the above function pointers and variables. Instead, it can only implement specific related parts. The other parts will be automatically filled with the default values at runtime, and directly pass it to the next translator. At the same time, specify the callback function. The callback function returns the result of the previous translator.

Translator adopts the implementation mechanism of asynchronous and callback functions, which means that the code for processing a specific request must be divided into two parts: Call function and callback function. A xlator function calls the next translator function, and then returns the result without blocking. When the next translator function is called, The callback function may be called immediately or later in a different thread. In both cases, the callback function does not obtain the context as the synchronous function does. Glusterfs provides several methods for saving and passing context between calling functions and their callback functions, but must be handled by xlator rather than relying entirely on the protocol stack.

The callback mechanism of translator mainly uses stack_wind and stack_unwind. When a function of xlator fops is called, it indicates that a request is received and represented by frame stack. In the FoPs function, perform the corresponding operation, and then use stack_wind to pass the request to the next or multiple translators. Stack_unwind must be called when a request is completed without calling the next translator, or when the task is completed and returned from the callback function to the previous translator. In fact, it is best to use stack_unwind_strict, which can be used to specify the type of requests you have completed. Related macros are defined in stack. h. The prototype is as follows:

#define STACK_WIND(frame, rfn, obj, fn, params ...)#define STACK_WIND_COOKIE(frame, rfn, cky, obj, fn,params ...)#define STACK_UNWIND(frame, params ...)#define STACK_UNWIND_STRICT(op, frame, params ...)

The parameters used are as follows:

Frame: stack frame indicates the request.
Rfn: callback function. This function is called when the next translator completes.
OBJ: The controlled translator object.
FN: Specifies the translator function to be called from the FoPs table of the next translator.
Params: Any other parameters of the called function (such as inodes, FD, offset, and data buffer)
Ky: Cookie. This is an opaque pointer.
OP: operation type, used to check that the additional parameters meet the expectations of the Function

Each translator-stack frame has a local pointer to store the specific context of the translator. This is the main mechanism for storing context between calls and callback functions. When the stack is destroyed, if the local value of each frame is not null, it will be passed to gf_free, but no other cleanup operations will be performed. If the local struct contains pointers or references other objects, You need to carefully process them. Therefore, it is ideal that the memory and other resources can be released before the stack is destroyed. Do not rely entirely on the automatic gfs_gfree. The most appropriate method is to define the destroy function of a specific translator and manually call it before stack_unwind returns.

Most of the calling and callback functions of xlator use the file descriptor (fd_t) or inode (inode_t) as parameters. Generally, translator needs to store some self-contained contexts, which are independent of the lifecycle of a single request. For example, layout map corresponding to the DHT storage directory and the last known location of an inode. Glusterfs provides a series of functions to store such contexts. In each case, the second parameter is a pointer to the translator object, and the data to be stored is related to it. The stored value is an unsigned 64-bit integer. These functions return 0 to indicate success. In the _ Get and _ del functions, reference parameters are used instead of return values.

inode_ctx_put (inode, xlator, value)inode_ctx_get (inode, xlator, &value)inode_ctx_del (inode, xlator, &value)fd_ctx_set (fd, xlator, value)fd_ctx_get (fd, xlator, &value)fd_ctx_del (fd, xlator, &value)

The inode_t or fd_t pointer passed to the call function and callback function is only referenced by borrowed. If you want this object to exist later, you 'd better call inode_ref or fd_ref to add a persistent reference, and call inode_unref or fd_unref when the reference is no longer needed.

Another common type is dict_t, which is a general sort dictionary or hash-map data structure. It can be used to store any type of values and use strings as key values. For example, the stored value can be a signed or unsigned integer, string, or binary of any size. The string and binary need to be marked and released by the glusterfs function when not needed, or by glibc or not released at all. Dict_t * and * data_t objects are all referenced and counted, and are released only when the number of referenced objects is 0. Like inodes and file descriptors, if the dict_t that you want to accept by parameters persists, you must call the _ ref and _ unref processor lifecycles. The dictionary is not only used for calling and callback functions, but can also be used to pass different module options, including the translator initialization options. In fact, the init function of translator is mainly used to parse the options in the dictionary. To add an option to translatro, you must add an object to the options array of translator. Each option can be boolean, integer, String, path, translator name, and other custom types. If it is a string, you can specify a valid value. The parsed options and other information can be stored in the private field of the xlator_t struct.

Most logging operations in translators are implemented through the gf_log function. The parameters include string (usually this-> name), log level, formatting string, and other formatting parameters. Log levels include gf_log_error, gf_log_warning, gf_log_log, and gf_log_debug. Xlator can encapsulate gfs_log custom macros or use existing levels, so that the logs of translator can be output at runtime. When designing xlator, you can add a translator log level option or implement a specific xattr call to pass new values.

4. construct a new xlator

Here, we construct a null xlator to sort out the basic method for constructing a new xlator. Null xlator itself does not implement specific functions. It serves only as a proxy-like transit to demonstrate the structure and method of constructing xlator. The null xlator implementation includes four files: NULL. H, null. C, null_fops.h, and null_fops.c. The null_fops.h, null_fops.c, and ults. H, ults. C are identical. The content of null. H is as follows:

#ifndef __NULL_H__#define __NULL_H__#ifndef _CONFIG_H#define _CONFIG_H#include "config.h"#endif#include "mem-types.h"typedef struct {        xlator_t *target;} null_private_t;enum gf_null_mem_types_ {        gf_null_mt_priv_t = gf_common_mt_end + 1,        gf_null_mt_end};#endif /* __NULL_H__ */

The private data struct null_private_t and internal data types are customized. The content of null. C is as follows:

#include <ctype.h>#include <sys/uio.h>#ifndef _CONFIG_H#define _CONFIG_H#include "config.h"#endif#include "glusterfs.h"#include "call-stub.h"#include "defaults.h"#include "logging.h"#include "xlator.h"#include "null.h"#include "null_fops.h"int32_tinit (xlator_t *this){        xlator_t *tgt_xl = NULL;        null_private_t *priv = NULL;        if (!this->children || this->children->next) {                gf_log (this->name, GF_LOG_ERROR,                        "FATAL: null should have exactly one child");                return -1;        }        priv = GF_CALLOC (1, sizeof (null_private_t), gf_null_mt_priv_t);        if (!priv)                return -1;        /* Init priv here */        priv->target = tgt_xl;        gf_log (this->name, GF_LOG_DEBUG, "null xlator loaded");        return 0;}voidfini (xlator_t *this){        null_private_t *priv = this->private;        if (!priv)                return;        this->private = NULL;        GF_FREE (priv);        return;}struct xlator_fops fops = {        .lookup         = null_lookup,        .stat           = null_stat,        .fstat          = null_fstat,        .truncate       = null_truncate,        .ftruncate      = null_ftruncate,        .access         = null_access,        .readlink       = null_readlink,        .mknod          = null_mknod,        .mkdir          = null_mkdir,        .unlink         = null_unlink,        .rmdir          = null_rmdir,        .symlink        = null_symlink,        .rename         = null_rename,        .link           = null_link,        .create         = null_create,        .open           = null_open,        .readv          = null_readv,        .writev         = null_writev,        .flush          = null_flush,        .fsync          = null_fsync,        .opendir        = null_opendir,        .readdir        = null_readdir,        .readdirp       = null_readdirp,        .fsyncdir       = null_fsyncdir,        .statfs         = null_statfs,        .setxattr       = null_setxattr,        .getxattr       = null_getxattr,        .fsetxattr      = null_fsetxattr,        .fgetxattr      = null_fgetxattr,        .removexattr    = null_removexattr,        .lk             = null_lk,        .inodelk        = null_inodelk,        .finodelk       = null_finodelk,        .entrylk        = null_entrylk,        .fentrylk       = null_fentrylk,        .rchecksum      = null_rchecksum,        .xattrop        = null_xattrop,        .fxattrop       = null_fxattrop,        .setattr        = null_setattr,        .fsetattr       = null_fsetattr,        .getspec        = null_getspec,};struct xlator_cbks cbks = {        .forget = null_forget,        .release = null_release,        .releasedir = null_releasedir,};struct volume_options options[] = {        { .key  = {NULL} },};

This mainly implements the init and fini functions mentioned above, fops call function pointers, cbks callback function pointers, and volume Parameter options. This is the basic framework of a new xlator code. Because no specific function is implemented here, various struct, variables, and function implementations are relatively simple. If you want to implement the xlator of a specific function, you can use this module for expansion. Xlator is a good example in the glusterfs source code, but it is complicated and not suitable for beginners. You can start with the simple ROT-13, read-only, bypass, xlator, such as negative-lookup, began to study and then constructed the xlator for their desired functions.

5. Compile the new xlator

After designing and coding the new xlator, We need to compile it into a dynamic library in the so form for glusterfs to use. During the compilation process, you need to use the relevant code of other parts of glusterfs and set up a complicated compilation environment. Here we have compiled and compiled the makefile setting environment. The content is as follows:

# Change these to match your source code.TARGET  = null.soOBJECTS = null.o null_fops.o# Change these to match your environment.GLFS_SRC  = /home/liuag/glusterfs-3.2.5GLFS_VERS = 3.2.5GLFS_LIB  = /opt/glusterfs/3.2.5/lib64/HOST_OS  = GF_LINUX_HOST_OS# You shouldn't need to change anything below here.CFLAGS  = -fPIC -Wall -O2 \          -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \          -I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src \          -I$(GLFS_SRC)/contrib/uuid -I.LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs -lpthread$(TARGET): $(OBJECTS)        $(CC) $(CFLAGS) $(OBJECTS) $(LDFLAGS) -o $(TARGET)install: $(TARGET)        cp $(TARGET) $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/nullclean:        rm -f $(TARGET) $(OBJECTS)

Set makefile to a compilation environment that matches your own, and then directly make to generate the null. So dynamic library. make install to install the new null xlator. So far, a new xlator is successfully constructed, and then we can use it.

6. test the new xlator.

The most exciting time has finally arrived. Now we can add and test the newly constructed null xlator by modifying the volume configuration file. This xlator can work on the client or server, and can be implemented by modifying the corresponding volume configuration file or fuse volume configuration file. The following uses server load as an example. The local volume configuration is modified as follows:

volume test-posix    type storage/posix    option directory /data/test-1end-volumevolume test-null    type null/null    subvolumes test-posixend-volumevolume test-access-control    type features/access-control    subvolumes test-nullend-volume… …

OK. Now restart the glusterd service and mount the volume to test the null xlator function. Of course, you may not be able to test any function, because we have not implemented any function.

7. References

[1] translator 101 Lesson 1: Setting the stage,

Http://hekafs.org/index.php/2011/11/translator-101-class-1-setting-the-stage/

[2] translator 101 Lesson 2: init, Fini, and privatecontext,

Http://hekafs.org/index.php/2011/11/translator-101-lesson-2-init-fini-and-private-context/

[3] translator 101 Lesson 3: This time for real,

Http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/

[4] translator 101 Lesson 4: debugging a translator,

Http://hekafs.org/index.php/2011/11/translator-101-lesson-4-debugging-a-translator/

[5] glusterfs translator API,
Http://hekafs.org/dist/xlator_api_2.html

[6] glusterfs translator concepts,

Http://www.gluster.org/community/documentation/index.php/GlusterFS_Concepts#Translator

[7] glusterfs rot-13 translator,
Https://github.com/jdarcy/glusterfs/tree/master/xlators/encryption/rot-13

[8] glusterfs read-only translator,
Https://github.com/jdarcy/glusterfs/tree/master/xlators/features/read-only

[9] glusterfs bypass translator,
Https://github.com/jdarcy/bypass

[10] glusterfsnegative-lookup translator,
Https://github.com/jdarcy/negative-lookup

[11] glusterfs Cluster File System Research,
Http://blog.csdn.net/liuben/article/details/6284551

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.