Research on namespace Implementation of Distributed File System

Source: Internet
Author: User

1. namespace Overview
Namespace is an important part of the file system. It provides users with a visual and understandable view of the file system, this solves or reduces the semantic interval between humans and computers on data storage. At present, the tree structure of the file system is most similar to that of the real world, and is widely accepted. Therefore, most file systems use tree to organize file directories, including various disk file systems (extx, XFS, JFS, reiserfs, ZFS, btrfs, NTFS, and FAT32) network File System (NFS, AFS, CIFS/smb, etc.), Cluster File System (lustre, pnfs, PVFS, gpfs, panfs, etc.), Distributed File System (googlefs, HDFS, MFs, KFS, taobaofs, fastdfs, etc ).

With the development of object-oriented storage and cloud storage, a file system organization method called flat has emerged. Typical examples include lustre, panfs, Amazon S3, and Google storage. In this way, all file directories are treated as objects. Each object has a globally unique UUID, which is used by the user rather than by the path to access the storage system. However, UUID is only meaningful to computers. At the user interface layer, you need to provide a tree-like file system view, which is then converted between path and UUID. On the object storage layer, objects or object data are stored as files in parts on the disk file system. The physical storage layer is still a tree-like storage structure. In addition, the object-based storage system (CAS), which is widely used in compliance with laws and regulations in the field of data storage, adopts a similar mechanism.

Specifically, the namespace of the disk file system is directly implemented on the disk, which is usually organized in the form of B */B +/B-tree, metadata and data are stored on the same media. For distributed file systems, metadata and data are separated from storage and access, which is determined by design requirements such as high performance, availability, and scalability. Generally, data access is implemented by the I/O server, while metadata is managed by the metadata server. Namespace is one of the core tasks of the metadata server. It may also be responsible for security mechanisms (such as authorization and Authentication), lock mechanisms, and I/O load balancing. Therefore, due to the separation of metadata and data, the namespace of the Distributed File System has a great degree of freedom, and more options are available for implementation. Here we will introduce four implementation mechanisms for the namespace of distributed file systems, all of which are tree file system views, which are roughly divided into file system-based implementation and full-memory implementation, but does not include database-based implementation. Database-based implementation of File System names has well-known performance problems, especially recursive traversal of file directory space.

2. Implementation of four File System namespaces
(1) file system-based Design
This is a design of "standing on the shoulders of giants. The disk file system itself is a tree structure view, so this ready-made mechanism can be used to implement namespace on the metadata server. For each directory or file in a distributed file system, a directory or file (meta directory and Meta File) is created one by one on the local file system of the metadata server ), the ing relationship between the two is shown in 1. The metadata directory is used to indicate the directories in DFS, and its metadata attributes are used to save the DFS directory attributes; the metadata file is used to represent the files in DFS, the metadata file attributes are saved to the DFS file attributes, and the metadata file content is used to save the metadata, including more detailed file attributes, access control information, data partition information, data storage location and other information.

Figure 1 file system-based design (DFS and local file system name ing)

Based on the file system, we constructed the DFS namespace at a very small cost, which is simple and fast. Metadata files are only used to store metadata of data files. Generally, they are small files smaller than 1 kb. If the number of file directories reaches tens of millions, the losf (lots of small files) performance issue will occur. How can this problem be solved in practical applications? Currently, there are two main solutions: first, a file system suitable for storing massive small files. Reiserfs optimizes small file storage. It not only improves file search efficiency, but also saves disk storage space. The actual test results also verify this. Second, use high-performance storage media, especially iops indicators. Fortunately, SSD technology is quite mature and costs are constantly decreasing. It is very suitable for high-performance storage applications. SSD features high iops, and the average SSD read/write IPOs can reach 10000 ~ 50000, high-end SSDS can even reach more than 100000, while FC, SAS, and SATA disk IPOs are basically less than 300, far less than SSD. Therefore, the performance of SSD and reiserfs file systems can be greatly improved, and most applications have little problems.

(2) full-memory-Based Hierarchical Design
This method is similar to HDFS implementation. Different from the implementation based on the file system, The namespace is completely in the memory of the metadata server and is represented by a hierarchy, as shown in figure 2. This hierarchy is equivalent to a tree. Each node represents a DFS directory or file. The child node of the node has no limit in quantity (depending on the memory usage) in theory, and the child node is represented by a dynamic array. The data structure of a node is as follows. metadata indicates similar metadata information in the file directory (1). Children is a dynamic array of a child node. The binary method is used to insert, search, and delete data, sort by name in ascending order.

Figure 2 full-memory Hierarchical Design

Struct inode {<br/> char * dname;/* directory or file name, excluding path */<br/> char * metadata; /* metadata */<br/> struct inode ** children;/* child node array */<br/> uint32_t count; /* sub-directories and file counters */<br/> };

For file system ls operations, first parse the path and split it into independent directory names, and then start searching from the root node. The child node array uses the binary search binarysearch search of logn for searching, you can find the corresponding directory node and traverse the child node array of the node. If the directory depth is H and the Directory width is N, the time complexity of searching the directory file is O (h * log (n )). For file systems, the search time is a little complicated, especially for Distributed File Systems with deep directory levels and a large number of sub-directory files. The HDFS design idea comes from GFS, but there is still a gap with gfs in the namespace design, which is similar to nothing else. In addition, the full memory design has a high requirement on memory. Assume that the metadata size of each directory file is 100 bytes, the total metadata size of the 10 million directory files is approximately equal to 1,000,000,000 = 1 GB. To support more directory files, you need to increase the memory size.

(3) full-memory-based hash Design
This method is similar to GFS implementation. In the GFS paper, it is pointed out that the namespace adopts a full-memory design, partial-level organization, prefix compression algorithm, binary search algorithm, and data structure that does not support ls, the paper also pointed out that the LS operation is less efficient. GFS is not open-source. Unlike HDFS, you can view the original code. Therefore, you cannot completely reproduce the implementation of GFS namespace. The basic full-memory hash design may be close to its design. This design is implemented by combining hash and binary search, that is, the directory is hash located with the complete absolute path, and the child nodes in the directory are located using binary search, as shown in 3. It is different from the hierarchical design in that only one hash and one binary search are required, while the hierarchical design requires multiple binary searches, which provides better performance. We only hash the directories. The namespace is partial to a certain extent, but not completely partial to gfs. The sub-file directory does not include the parent path, which is equivalent to prefix compression, however, it is better to compress the hierarchical prefix completely. Bold guesses that GFS may adopt a full hash design or a full list design. The ls operation is implemented by prefix matching, that is, files with the same prefix belong to the same directory, so as to implement namespace.

 

Figure 3 memory-based hash Design

In this design, find the specified file or execute ls, first break down the path into the parent path name and directory name, and perform the hash operation on the parent path name to locate its child node list, then, the system searches for the specified file in binary mode, or traverses the child node list to perform the LS operation. Assuming that the directory width is n and the search time complexity is log (n), the memory usage is slightly larger than the hierarchical design, because the directory nodes are repeated once. This design has a data structure that supports ls. Compared with GFS, it is much more efficient to execute ls. If GFS is a full hash design, you need to traverse the entire file space for prefix matching, if GFS is a full list design, you need to perform a binary search with the parent path name and then perform a local prefix match.

(4) Dual hash design based on full memory
This method is an improvement in the full-memory hash-based design. It first performs the first hash operation on the directory, and then performs the second hash operation on the sub-file directory, which further reduces the search time complexity from log (n) to O (2), 4. The directory hash table is global, while the hash table of the directory node is local. Each directory node contains a hash table, which is only used to store the sub-file directory information in the local directory, the data structure of the directory node is as follows.

Figure 4 memory-based dual hash Design

Struct inode {<br/> char * dname;/* directory or file name, including path */<br/> char * metadata; /* metadata */<br/> hashtable * children;/* child node hash table */<br/> uint32_t count; /* sub-directories and file counters */<br/> };

For file system ls operations, perform a hash operation on the path name to locate the directory node, and then traverse the hash table in the directory node. During file search, the path name is first divided into the parent path name and the file directory name, and the parent directory node is located by performing the hash operation on the parent path name, then, hash the file directory name and locate the hash table in the parent directory node. The hash table in the directory node is not initially created until the first sub-file directory is created. The number of hash table items is defined as the average number of sub-file directories in the directory, compromise between performance and memory space savings. If the memory is sufficient, set the number of hash table items as large as possible to achieve better hash effect. Compared with the full-memory hash design, this design improves the search performance and increases the memory consumption.

3. Comparative Analysis and Application Selection
The four implementation methods of the Distributed File System namespace are divided by implementation location, which can be divided into file system-based implementation and memory-based implementation. The advantages of file system-based implementation are simple implementation, low memory requirements, and running on normal machines. The disadvantage is that the performance may be low. If SSD + reiserfs is used, the performance should not be a big problem, but the cost will also increase a lot. The advantage of memory-based implementation is high performance. The disadvantage is that the memory has extremely high requirements and is complicated to implement. In addition, you need to implement Persistent protection measures for the memory namespace to prevent unexpected downtime or errors. Based on three memory implementations, dual hash Design for performance> hash design> hierarchical design, the opposite for memory requirements. In actual implementation and application, you should select based on the cost budget and performance requirements. The selection principle is to maximize the cost-performance ratio while meeting the design requirements.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.