A few key data structures in Namenode

Source: Internet
Author: User

Reprint Address: http://blog.csdn.net/AE86_FC/article/details/5842020
detailed Anatomy of the Namenode startup process a few key data structures in Namenode Fsimage

Namenode will store HDFs file and directory metadata in a binary file called Fsimage, and all HDFS operations between each save Fsimage and next save will be recorded in the Editlog file. When the editlog reaches a certain size (bytes, Defined by the Fs.checkpoint.size parameter) or after a certain period of time since the last save (sec, defined by the Fs.checkpoint.period parameter), Namenode will re-swipe the entire HDFs tree and file metadata in memory into the Fsimage file. Namenode is the way to ensure the security of metadata information in HDFs.

Fsimage is a binary file that records metadata information for all files and directories in HDFs, and in the HDFs version of my Hadoop, the file and directory formats are as follows:

When Namenode restarts loading fsimage, the metadata information is loaded from the file stream according to the following format protocol. As can be seen from the storage format of Fsimag, Fsimage Save has the following information:

1. First is an image head, which contains:

a) imgversion (int): Version information for the current image

b) Namespaceid (int): used to ensure that datanode in other HDFs instance are not mistakenly connected to the current NN.

c) numfiles (long): How many files and directories are included in the entire file system

d) Genstamp (long): timestamp information when the image was generated.

2. Next is the source data information for each file or directory, and if it is a directory, contains the following information:

A) path (String): The path to the directory, such as "/user/build/build-index"

b) replications (short): Number of replicas (although there is no copy of the directory, the number of directory copies recorded here is also 3)

c) Mtime (long): Timestamp information for the modified time of the directory

d) atime (long): Timestamp information for the access time of the directory

e) blocksize (long): The blocksize of the directory is 0

f) numblocks (int): The actual number of file blocks, the value of the directory is 1, indicating that item is a directory

g) Nsquota (Long): namespace quota value, if no quota limit is-1

h) Dsquota (long): disk quota value, or 1 if no limit is added

i) Username (String): The name of the user who owns the directory

j) Group (String): The group to which the directory belongs

k) Permission (short): The directory of permission information, such as 644, has a short to record.

3. If the item read from Fsimage is a file, it will also contain the following additional information:

A) Blockid (long): The Blockid of the block belonging to the file,

b) numbytes (long): The size of the block

c) Genstamp (long): Timestamp of the Block

When the file corresponding to the number of numblocks is not 1, but more than 1 o'clock, indicating that the file should have more than block information, immediately after the fsimage there will be more than blockid,numbytes and genstamp information.

Therefore, when Namenode starts, it is necessary to load the fsimage in the following format in order to load the HDFS metadata information that is logged in Fsimage into memory.

Blockmap

Loading information from the above fsimage such as namenode memory can be seen clearly, in the Fsimage, does not record each block corresponding to the number of datanodes the corresponding table information, but only the storage of all the relevant information about namespace. And the information that really each block corresponds to the Datanodes list is not persisted in Hadoop, but every Datanode scans the local disk when all Datanode are started. The block information stored on this datanode will be reported to Namenode,namenode after receiving the block information of each datanode, and the information received will be stored in memory, as well as the Datanode information it contains. HDFs is a block-by-datanodes list that is built using this kind of information reporting method. Datanode the process of reporting block information to Namenode is called blockreport, and Namenode stores the table information for the Datanodes list in a block, Blocksmap in the data structure.

The internal data structure of BLOCKSMAP is as follows:

As shown, Blocksmap is actually a map table of a block object to the Blockinfo object, where only blockid,block size and timestamp information are recorded in the Block object, which is recorded in Fsimage. The Blockinfo is inherited from the Block object, so in addition to the information stored in the Block object, it includes the Inodefile object reference that represents the HDFs file to which the block belongs, and the information about the Datanodes list that the block belongs to (that is, the DN1 in the DN2,DN3, the data structure is detailed below).

So after the namenode starts and loads the fsimage, actually the key in the Blocksmap, which is the Block object, is loaded into the blocksmap, each key corresponds to the value (Blockinfo), In addition to the array of datanodes lists that it belongs to is empty, other information has been loaded successfully. So it can be said that after the fsimage is loaded, only the corresponding relationship information for each block corresponding to the Datanodes list to which it belongs is missing in Blocksmap. The missing information is constructed by receiving Blockreport from each datanode mentioned above. When all the Datanode reported to Namenode blockreport processing finished, blocksmap the entire structure is built.

Datanode list data structure in Blockmap

In Blockinfo, the Datanodes list that the block belongs to is saved in a object[] array, but the array not only holds the Datanodes list, but also contains additional information. In fact, the array holds the following information:

Indicates that a block contains three copies, placed on the DN1,DN2 and DN3 three datanode, each datanode corresponding to a ternary group, the second element in the ternary group, the middle prev Block refers to the block's previous blockinfo reference on the Datanode. The third element, the middle next block, refers to the block's next blockinfo reference on the Datanode. How many copies each block has, and how many of these triples are in the corresponding Blockinfo object.

Namenode uses this structure to save the Block->datanode list in order to save Namenode memory. Because Namenode the correspondence of Block->datanodes in memory, as the number of Chinese pieces of HDFS increases, the number of blocks will increase correspondingly, namenode in order to save block-> Datanodes information has consumed a considerable amount of memory, if the same way to save the Datanode->block list of the corresponding table, it is bound to consume more memory, and in the actual application, to find a datanode saved block List application is actually very few, most of the time is to check the Datanode list according to block, so namenode in the way to save the Block->datanode list of the corresponding relationship, when the need to query datanode-> When the block list corresponds, it is only necessary to follow the next block in the data structure to derive the result without saving the Datanode->block list in memory.

Namenode boot process fsimage loading process

The Fsimage loading process is done primarily to:

1. Read every directory and every file saved in the HDFs from the Fsimage

2. Initialize meta-data information for each directory and file

3. Construct the image of the entire namespace in memory according to the directory and the path of the file

4. If it is a file, read out all the blockid contained in the file and insert it into blocksmap.

The entire loading process is as follows:

As shown, Namenode loading the fsimage process is very simple, is to read the file and directory metadata information from Fsimage, and build the entire namespace in memory, while saving each file corresponding Blockid into Blocksmap , the Datanodes list for each block in Blocksmap is temporarily empty. When the fsimage load is complete, the entire HDFS directory structure in memory has been initialized, the missing is the corresponding block of each file Datanode list information. This information needs to be obtained from the Datanode Blockreport, so after loading the fsimage, the namenode process goes into the RPC wait state, waiting for all datanodes to send blockreports.

Blockreport Stage

Each datanode scans its machine for all file blocks stored in the directory that holds the HDFS block (Dfs.data.dir) at startup, The block information is then sent to Namenode,namenode with a long array through RPC calls to the Namenode to parse out the block array from the RPC after receiving a Datanode blockreport RPC call. and insert these received blocks into the Blocksmap table, because at this time Blocksmap is missing only the corresponding Datanode information for each block, And Namenoe can learn from the report of the current reports on which Datanode block information, so, blockreport process is actually namenode after receiving the block information reporting, The process of populating the ternary information for each block in the blocksmap that corresponds to the Datanodes list. The process is as follows:

When all Datanode reports Block,namenode for each datanode, the Namenode start-up process ends. At this point, the corresponding relationship of Block->datanodes in Blocksmap has been initialized. If the launch threshold for Safe mode is reached at this point, HDFs exits Safe mode and starts providing services.

Start-up process data acquisition and bottleneck analysis

After a detailed understanding of the whole start-up process of namenode, it is possible to take profiling of the call time of each function during the start-up process, and the profiling of the data is still divided into two stages, namely fsimage loading stage and blockreport stage.

Data acquisition and bottleneck analysis of fsimage loading stage

The following is a collection of performance data for the real fsimage loading process of the built-in cluster:

As can be seen from the fsimage loading process, the main time-consuming operations were distributed in fsdirectory.addtoparent,fsimage.readstring, and Permissionstatus.read Three operations, these three operations occupy the loading process of 73%,15% and 8%, adding up the total consumption of the entire load process of 96%. Where the fsimage.readstring and permissionstatus.read operations are read from the Fsimage file stream (read string and short, respectively). This operation optimizes little space, but improves performance by adjusting the buffer size of the file stream. The fsdirectory.addtoparent call takes up 73% of the entire load process, so the optimization space in the call is larger.

The following is the profiling data in the Addtoparent call:

From the above data we can see that the addtoparent call takes up 73% of the time, 66% are consumed on the inode.getpathcomponents call, and these 66% have 36% consumption in Inode.getpathnames call, 30% Consumed in the inode.getpathcomponents call. The specific distribution of these two time-consuming operations is shown in the following data:

As you can see, the inode.getpathnames operation, which consumes 36% of the processing time, is all used to slice the file or directory path through the String.Split function. In addition, it consumes about 30% of the processing time in Inode.getpathcomponents, the final time spent in this function is consumed in the Java native operation of the byte array that gets the string.

Blockreport Stage Energy data acquisition and bottleneck analysis

Since the call to Blockreport is called by the Datanode call to Namenode, the listener thread of the RPC call and the processing thread of the RPC call are opened separately after the Namenode enters the wait blockreport stage. where RPC processing and RPC authentication call time-consuming distributions as shown:

While the RPC Listener thread optimization is another topic, discussed in detail in other issue, and because Blockreport's operation is actually triggered by the RPC processing thread, it is only concerned with the RPC processing thread's performance data.

The invocation time-consuming performance data for the Namenode process Blockreport is as follows:

As you can see, in the Namenode start-up phase, processing the Blockreport reported from each datanode consumes most of the time (48/49) of the entire RPC process, and the time-consuming distribution in blockreport processing logic such as:

from the number It can be found that the time-consuming distribution in the blockreport phase is mainly time consuming during the Fsnamesystem.addstoredblock call and the Datanodedescriptor.reportdiff process, respectively 37/48 and 10/ 48, wherein the operation of Fsnamesystem.addstoredblock to each of the reported block, the Datanode of its report on the corresponding relationship initialized to Namenode in memory Blocksmap table. So the method is called once for each block. So you can see that the method calls the 774819 times throughout the process, and another time-consuming operation, That is Datanodedescriptor.reportdiff, the process of the operation is described in detail in the above, mainly to the Datanode reported on the blocks and Namenode in the memory of Blocksmap in contrast, to determine which is to be added to the BLOCKSM Block in the AP, which blocks to add to the Toremove queue, and which blocks are added to the tovalidate queue. This process is also time consuming because this operation requires querying Blocksmap for each block that is reported, as well as several other maps in the Namenode. And as can be seen from the number of calls, Reportdiff calls only 14 times during the boot process (14 Datanode for block reporting), but it takes 10/48 of the time. So Reportdiff is also a very time-consuming bottleneck throughout the blockreport process.

At the same time, it can be seen that the Reportdiff,addstoredblock call took 37% of the time, that is, 37 of the entire Blockreport time spent 48, the method is called to insert each block reported from Datanode into the blocksmap operation. The running data from the method call is as follows:

As can be seen, in the Addstoredblock, The main time-consuming two phases are fsnamesystem.countnode and datanodedescriptor.addblock, which are table operations in Java, and Fsnamesystem.countnode calls are intended to be counted in Blocksmap , each block corresponding to each copy, there are several are live state, several are decommission state, several are corrupt state. In the start-up phase of Namenode, the map of the block used to hold the corrput state and the decommission state is still empty, and the program logic is to get only the block number of the live state, so The countnoes call here Namenode start the initialization phase without having to count the number of corrrput and decommission in the corresponding copy of each block, and just need to count the number of block copies in the live state. This allows the countnodes to become lighter in the Namenode startup phase to save startup time.

2.3 Bottleneck Analysis Summary

In terms of profiling data and bottlenecks, the bottleneck in the Fsimage loading phase, in addition to being suboptimal in the process of cutting paths, is almost always in the Java native interface invocation, such as reading data from a byte stream and getting byte[from a string object] The operation of the array.

And the Blockreport phase time is actually very large reason is related to the current namenode design and memory structure, the more obvious is the Countnode and Reportdiff in the Namenode start-up phase of the necessity, There are some unnecessary operations that waste time in the blockreport phase of the Namenode initialization. It is possible to extract the necessary actions for the Namenode start phase and make it into the namenode start phase to optimize the Namenode startup performance.

A few key data structures in Namenode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.