Read and analyze the KFS source code of the Distributed File System (I): metaserver metadata Organization Structure

Source: Internet
Author: User

Metaserver metadata management of the KFS file system uses the B + tree method. The following uses the source code to analyze the organization and implementation details of metadata in KFS metaserver.

1. Related source code files

KFS metaserver metadata managementCodeThe directory is KFS-[version]/src/CC/meta. The relevant source code files include:

(1) Meta/base. h: basic class of each node in the KFS metadata, includingClass:Class key,Metanode class, which represents the data stored by B + trees and the public basic information of all B + Tree nodes.

(2) Meta/meta. hAndMeta/meta. CC: EncapsulatedMetadataMetadata, metadentry, metafattr, and metachunkinfo. They represent the directory items in the file system, the attribute items in the file or directory, and (File offset)ChunkInformation.

(3) Meta/kfstree. hAndMeta/kfstree. CC: EncapsulatesB +Nodes in the treeNodeVarious operations andTree(Not related to the file system,B +Tree underlying implementation), such as inserting and deleting nodes.

(4) Meta/kfsops. CC: Encapsulated for useB +Tree StorageKFSFile System, such as creating files, deleting files, creating directories, and deleting directories.Tree).

(5) Meta/request. hAndMeta/request. CC: EncapsulatesChunkserverOrKfsclientIssuedMeta dataRequest Processing, throughTree primary REEPerform corresponding operationsKFSCall various basic operations of the file system.

2. Why B + tree?

The KFS file system uses the B + tree. Why is the B + tree instead of the B-tree? Here is a simple analysis:

2.1 B-tree

B-tree definition:

B-tree is a balanced multi-path search tree, a B-tree of the m level, an empty tree, or an M-tree that meets the following features:

(1) Each node in a tree can contain up to m words;

(2) If the root node is not a leaf node, there are at least two Subtrees;

(3) All non-terminal nodes except the root node have at least [m/2] subtree;

(4) All non-terminal nodes contain the following information data: (n, P0, K1, P1, K2, P2,..., kN, PN ),

Where: Ki is the keyword, and Ki <Ki + 1; Pi is the pointer to the Child root node, the keyword that satisfies all nodes in the subtree indicated by PI is greater than KI and less than Ki + 1, and the keyword of all nodes in the subtree indicated by PN is greater than kN;

(5) All leaf nodes are on the same layer.

B-tree search:

Start from the root node and perform a binary search for the sequence of ordered keywords in the node. If hit, the search process is directly completed. Otherwise, the son node that enters the scope of the query keyword is searched. Repeat the above process until the corresponding son pointer is null or is already a leaf node.

B-Tree features:

(1) the set of keywords is distributed in the entire tree;

(2) Any keyword appears only in one node;

(3) The search may end at a non-leaf node;

(4) its search performance is equivalent to performing a binary search in the complete set of keywords;

(5) Automatic hierarchical control.

2.2 B + trees

B + tree definition:

The B + tree is also a balanced multi-path search tree, which is a variant tree of B-tree that is required by the file system. A B + tree meets the following conditions:

(1) Each non-terminal node has at most M subtree;

(2) Apart from the root node, each other non-terminal node must have at least [(m + 1)/2] subtree;

(3) there are at least two Subtrees;

(4) nodes with N subtree contain N keywords;

(5) All leaf nodes contain information about all the keywords and pointers to records containing these keywords, and the leaf nodes themselves are connected in a small and large order based on the size of the keywords;

(6) All non-terminal nodes can be regarded as index parts. They only contain the maximum keywords in each of their subnodes and pointers to the subnodes.

Generally, the B + tree has two head pointers, one pointing to the root node and the other pointing to the leaf node with the smallest keyword.

B + tree retrieval:

The B + tree can be searched in two ways:

(1) One is to start from the header pointer pointing to the leaf node with the smallest keyword for sequential search;

(2) a random search starts from the header pointer pointing to the root node: similar to B-tree, it is equivalent to performing a binary search in the complete set of keywords, the difference is that B + trees hit only when they reach the leaf node (B-trees can hit non-leaf nodes ).

Features of the B + tree:

(1) All keywords appear in the linked list of leaf nodes (dense index), and the keywords in the linked list are exactly ordered;

(2) It is impossible to hit non-leaf nodes only when a leaf node is hit during retrieval;

(3) Non-leaf nodes are equivalent to leaf node indexes (sparse indexes), and leaf nodes are equivalent to data layers that store (keywords) data;

(4) It is more suitable for file index systems.

2.3 comparison between B + and B-trees

Through understanding the definitions and features of B-tree and B + tree, we can compare the two:

(1) space occupation:

    • B-the non-leaf node of the tree contains a large amount of keyword information, occupying a relatively large space;
    • Only leaf nodes in the B + tree have keyword information.The non-leaf node does not point to the specific information pointer of the keyword, And the occupied space is relatively small.

(2) search path length:

    • BecauseAll the keywords of the B + tree are distributed on the leaf node, and other non-leaf nodes are indexed. ThereforeThe level of tree (that is, the height of tree) is larger than that of B-. There are more paths to search, and the operation time is relatively long;
    • BecauseThe key words of the B-tree are distributed to each node. Compared to the full distribution of the B + tree to the leaf node, the order of the scattered distribution is naturally small,Therefore, the number of B-trees is smaller than that of B +, the number of paths to be searched is relatively small, and the computing time is relatively short.

For file system design, the most critical bottleneck lies in disk I/O operations. If the disk space is small, Io operations will naturally take less time. In the process of actually retrieving the data structures in the memory (such as B + tree and B-tree), the computing time is much smaller than that of disk I/O operations, that is, the memory retrieval time is not the main bottleneck.

Therefore, although for B-and B + trees of the same order, the height and average retrieval length of B + trees are greater than that of B-trees, the most time-consuming operation is the disk I/O operation, whileThe B-tree occupies a relatively large amount of space and has obvious disadvantages in Io operations.. Because the non-leaf node of the B + tree has no record information, only indexes are available, and disk space of the same size can store more index information, with fewer disk visits, the speed is faster than that of B-tree.

2.4 select B + tree

B + tree is more suitable for the file index and database index of the operating system in actual applications than B-tree, because:

(1) Low disk read/write costs: Even ifB-treeThe calculation time is shorter than that of B + tree.But because the diskIoOperationalDisadvantage, resulting inThe overall efficiency is less than B + tree.

(2) query efficiency is more stable: Any keyword search in the B + tree must go through the root node to the leaf node. Therefore, the path length of all keyword queries is the same, the query efficiency of each data is equivalent.

3. Metadata Organization Structure

The implementation of metaserver metadata in the KFS file system is illustrated as follows:Several types of B +Tree node:


(1) metanode:Common base classes for all leaf nodes and internal nodes, which record the types of different Tree nodes.

(2) node: indicates the internal node, which records various operations on the internal node of the tree.

(3) Meta:It indicates a leaf node. Specifically, different leaf nodes are:

    • Metadentry:File directory item (Directory Entry).ID.
    • Metafattr: file or directory attribute, equivalentKFSOneInodeNode.
    • Metachunkinfo: for a file offset (File offset)ChunkInformation.
3.1 metanode

Member variables:


Metatype type;//Node Type Value
IntFlagbits;//Flag Space



Metanode (same ype T)//Initialize type = T, flagbits = 0
Metanode (same ype t,IntF)//Initialize type = T, flagbits = f
3.2 Node

Member variables:


IntCount;//Number of child nodes
Key childkey [nkey];//Child key
Metanode*Childnode [nkey];//Child Node
Node*Next;//Next adjacent node



Node (IntF)//Initialize the node type in metanode = kfs_internal, flagbits = f
3.3 Meta

Member variables:

Fid_t FID;//File FID


Meta (primary ype T, fid_t ID)//Initialize the node type information type = T in metanode and its own FID = ID
3.4 metadentry

Member variables:


Fid_t dir;//FID of the parent directory
StringName;//Directory Name
Fid_t FID;//File ID of the Directory item




Metadentry (fid_t parent,StringFname, fid_t myid)


Example: PassDentryStructure implementation/Root/1.txtSearch process:

(1)Obtain"/"OfFID = 2

Dir = 2, name = "/",
FID = 2

(2)Obtain"Root"OfFID = 8

Dir = 2, name = "root ",
FID = 8

Dir = 2, name = "USR ",
FID = 6

(3)Obtain20171.txt"OfFID = 12

Dir = 8,
Name1_00001.txt ", FID = 12

Dir = 8, name1_1_2.txt ",
FID = 13

Dir = 8, name1_1_3.txt ",
FID = 14

From the above search process, we can see that,/Root/1.txtOfFIDIs12.

3.5 metafattr

Member variables:


 Filetype type;  //  Type (file or directory)  
Int16_t numreplicas; // Number of backups required by a file
Struct Timeval mtime; // Modification time
Struct Timeval ctime; // Attribute change time
Struct Timeval crtime; // Creation Time
Long Long Chunkcount; // Chunk count
Off_t filesize; // File Size



Metafattr (filetype T, fid_t ID, int16_t N)
Metafattr (filetype T, fid_t ID,StructTimeval Mt,StructTimeval CT,StructTimeval CRT,LongLongC, int16_t N)
3.6 metachunkinfo

Member variables:


Chunkoff_t offset;//Chunk offset in the file
Chunkid_t chunkid;//ID of the chunk
Seq_t chunkversion;//Chunk version






Metachunkinfo (fid_t file, chunkoff_t off, chunkid_t ID, seq_t V)
Metachunkinfo (fid_t file, chunkoff_t off, chunkid_t ID, seq_t V, clvector&M)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.