[Read hadoop source code] [8]-datanode-fsdataset

Source: Internet
Author: User
Tags tmp folder

Block-related operations are processed by dataset-related classes. The storage structure is from large to small: volume (fsvolume), directory (fsdir), and file (Block and metadata)

 

 

Block-related

 

 

The block class has three attributes.

Private long blockid; // blockid
Private long numbytes; // block size
Private long generationstamp; // block version number

A block is an abstraction of a data block. Through the previous discussion, we know that a block corresponds to two files, one storing data and the other storing verification information, as follows:

Blk_3148782621364391313

Blk_3148782621364391313_242812.meta

In the above information, blockid is 3148782682464391313,242812 is the version number of the data block. Of course, the system will also save the size of the data block, which is the attribute numbytes in the class. Block provides a series of methods to operate on object attributes.

Datanodeblockinfo stores block information on the file system. It stores the volume (fsvolume), file name, and detach status of the block.

Here it is necessary to explain the detach status: we have analyzed the status above. During the upgrade, the system will create a snapshot. the snapshot file and the data block file in the current and the data block file are hard-linked, point to the same content. When we need to change the files in current, if we do not perform the detach operation, the modified content will affect the files in snapshot, we need to remove the corresponding hard link. The method is very simple. In the Temporary Folder, copy the file and rename the temporary file to the corresponding file in current. In this way, the file in current and the file in snapshot will be detach. This technology is also calledCopy-on-writeIs an effective way to improve system performance. The detachblock in datanodeblockinfo can detach data files and metadata files corresponding to the block.

 

Relationship between fsdataset, fsvolumeset, fsvolume, and fsdir

Since multiple storage can be specified on datanode to store data blocks, HDFS specifies the number of blocks that can be stored in a directory. Therefore, multiple directories exist in a storage.

Correspondingly, fsvolume is used in fsdataset to correspond to a storage. fsdir corresponds to a directory. All fsvolume is managed by fsvolumeset, and all its buckets can be managed by a fsvolumeset object in fsdataset.

 

Fsdir-related

Fsdir corresponds to a directory in HDFS, which stores the data block file and its metadata file. By default, each directory contains a maximum of 64 sub-directories and a maximum of 64 sub-directories. When initializing a directory, it recursively scans the directories and files under the directory to form a tree structure.

The addblock method is used to add blocks to the current directory. If the current directory cannot accommodate more blocks, add the blocks to a sub-directory. If there are no sub-directories, create sub-directories. The getblockinfo and getvolumemap methods are used to recursively scan information of all blocks in the current directory. The clearpath method is used to update the information of all directories in the file path when deleting files.

Fsdir has four attributes.

File dir; // The subdirectory of the storage path current/
Int numblocks = 0; // number of data blocks currently stored in the storage directory
Fsdir children []; // subdirectory of the current/directory
Int lastchildidx = 0; // the serial number of the subdirectory that stores the previous data block

 

Public file addblock (Block B, file SRC) throws ioexception

Private file addblock (Block B, file SRC, Boolean createok, Boolean resetidx) throws ioexception

 

When a data block arrives at the datanode node, datanode does not immediately select a suitable storage directory for the data block in current/, but first stores it in the tmp/subdirectory of the storage path, after the data block is successfully accepted by the datanode node, it is moved to the appropriate directory under current /.

 

The datanode node first stores the data block of the file to the current/subdirectory of the storage path. When the current/subdirectory has stored maxblocksperdir data blocks, maxblocksperdir sub-directories will be created under the directory current/, and then a sub-directory will be selected to store the data block in this sub-directory; if the selected sub-directory also stores maxblocksperdir data blocks, create maxblocksperdir sub-directories under this sub-directory and select one from these sub-directories to store the data blocks. This is a recursive operation, until the remaining storage space in the storage path is insufficient to store a data block. The default value of maxblocksperdir is 64, but it can also be set through the datanode configuration file. The corresponding configuration option is DSF. datanode. numblocks.

 

Fsvolume Problems

The main attribute of the fsvolume class is

Private fsdir datadir; // The final location for storing valid data blocks (current /)
Private file tmpdir; // The middle location of the data block (TMP /)
Private file detachdir; // copy on write (detach/) for storing data blocks /)
Private DF usage; // obtains the space usage information of the current storage directory.
Private du dfsusage; // obtains the disk partition space of the current storage directory.
Private long reserved; // reserved storage space size

 

Each fsvolume is restored during initialization. It tries its best to restore the database, which may be affected by the downtime of the node where datanode is located.

1 ). for all data block files under detach/(there is no directory under detach/, only files), if the file does not exist under current/, move it to current, finally, clear the detach/directory.

2 ). if the datanode node is set to support the append operation (the corresponding configuration item is DFS. support. apend), then for all data block files under blocksbeingwritten/(blocksbeingwritten/does not exist in the directory, only files), if the file does not exist under current, move it to current/, and finally clear the blocksbeingwritten/directory; otherwise, clear the blocksbeingwritten/directory.

Fsvolume (File currentdir, configuration conf) throws ioexception {This. reserved = Conf. getlong ("DFS. datanode. du. reserved ", 0); this. datadir = new fsdir (currentdir); this. currentdir = currentdir; Boolean supportappends = Conf. getboolean ("DFS. support. append ", false); file parent = currentdir. getparentfile (); this. detachdir = new file (parent, "detach"); If (detachdir. exists () {recoverdetachedblocks (currentdir, detachdir); // restore from the detach directory} This. tmpdir = new file (parent, "tmp"); If (tmpdir. exists () {fileutil. fullydelete (tmpdir); // Delete the tmp directory} blocksbeingwritten = new file (parent, "blocksbeingwritten"); If (blocksbeingwritten. exists () {If (supportappends) {recoverblocksbeingwritten (blocksbeingwritten); // recover from blocksbeingwritten} else {fileutil. fullydelete (blocksbeingwritten );}}... this. usage = new DF (parent, conf); this. dfsusage = new du (parent, conf); this. dfsusage. start ();}

 

Other important functions are:

File addblock (Block B, file F) throws ioexception // F may be a file in the tmp directory {file blockfile = datadir. addblock (B, F); file Metafile = getmetafile (blockfile, B); dfsusage. incdfsused (B. getnumbytes () + Metafile. length (); Return blockfile;} Long getcapacity () throws ioexception {If (Reserved> usage. getcapacity () {return 0;} return usage. getcapacity ()-reserved; // It is unnecessary to call the getcapacity () function twice.} Long getavai Lable () throws ioexception {long remaining = getcapacity ()-getdfsused (); long available = usage. getavailable (); // here is a bug which should be loage. getavailable ()-reserved; because the above minus reserved if (remaining> available) {remaining = available;} return (remaining> 0 )? Remaining: 0 ;}

 

The DF and Du classes are used to regularly update the space usage information of the partition to make the statistics accurate. This is described below.

Use of DU and DF

To accurately obtain the total capacity, usage, and usage of the storage space of a datanode node, HDFS uses programs to implement the DF and Du commands of UNIX systems, they are used to obtain the usage of the system's local disk and the directory or file size information respectively.

HDFS uses the org. Apache. hadoop. fs. df class to implement the Unix DF command. The Org. Apache. hadoop. fs. Du class implements the Unix du command. Both the DF and Du classes use Java programs to execute shell script commands and want their respective functions.

The simple class diagram is as follows,

 

Fsvolumeset

Fsvolumeset manages all the fsvolume objects. In fact, it manages all the storage paths. Fsvolumeset selects a storage path (partition) for the storage data block provided by the upper layer (datanode process). It is to create a corresponding local disk file for the data block, at the same time, the load also counts the status information of its bucket and collects all the data block information. The getnextvolume () method is used in fsvolumeset to achieve load balancing. It is actually a cyclic queue.

synchronized FSVolume getNextVolume(long blockSize) throws IOException{    if (curVolume >= volumes.length)    {        curVolume = 0;    }    int startVolume = curVolume;    while (true)    {        FSVolume volume = volumes[curVolume];        curVolume = (curVolume + 1) % volumes.length;        if (volume.getAvailable() > blockSize)        {            return volume;        }        if (curVolume == startVolume)        {            throw new DiskOutOfSpaceException("Insufficient space for an additional block");        }    }}

 

A summary of the above relationships is shown in.

 

 

Fsdateaset Problems

This class and function are more complex and complex. For details, refer to the code. The following functions are important:

PublicBlockwritestreams writetoblock (Block B,BooleanIsrecovery)ThrowsIoexception;
Obtain the output stream of a block. Blockwritestreams includes both the data output stream and the metadata (validation file) output stream. This is a very complicated method.

The isrecovery parameter indicates whether this write operation is a restoration operation for previously failed writes. Let's first look at the normal write operation process. First, if the input block is a normal data block or the current block is already written by a thread, writetoblock throws an exception. Otherwise, a temporary data file and a temporary metadata file will be created, relevant information will be created, an activefile object will be created, recorded in ongoingcreates, and the returned blockwritestreams will be created. As mentioned above, when a new activefile is created, the current thread is automatically stored in threads of activefile.

Taking blk_3148782621364391313 as an example, when datanode needs to create a write stream for block ID 3148782637964391313, datanode creates a file tmp/blk_3148782621364391313 as a temporary data file, and the corresponding meta file is tmp/bytes. Xxxxxx is the version number.

When isrecovery is true, it indicates that we need to recover from an unsuccessful write. The process is more complex than the normal process. If the failed write is because the confirmation information after the submission (refer to the finalizeblock method) is not received, create a detached file (Backup) first ). Then, writetoblock checks whether there are other threads writing to files. If so, the thread is forcibly terminated through the interrupt method of the thread. This means that if a thread is still writing the corresponding file block, the thread will be terminated. At the same time, remove the corresponding information from ongoingcreates. Next, we will create/reuse temporary data files and temporary data metadata files based on whether temporary files exist. The subsequent operations are the same as normal procedures. Based on the relevant information, create an activefile object and record it to ongoingcreates ......

This part involves some strategies for writing files to HDFS, so we will continue to discuss this topic later.

 

Public VoidUpdateblock (Block oldblock, block newblock)ThrowsIoexception;
Update a block. This is also a complicated method.

The outermost layer of updateblock is an endless loop. The end condition of the loop is that there is no write thread related to this data block. Every cycle, updateblock calls an internal method called tryupdateblock. If tryupdateblock finds that no thread is writing this block, it will be related to the new data block, including the metadata file and the memory ing table volumemap. If tryupdateblock finds an active Thread associated with the block, updateblock tries to end the thread and waits for join.

 

Public VoidFinalizeblock (Block B)ThrowsIoexception;
Commit (or: Terminate finalize) the block opened through writetoblock. This means that there is no error in the write process. You can officially put the block from the TMP folder to the current folder. In fsdataset, finalizeblock will delete the corresponding block from ongoingcreates, and datanodeblockinfo corresponding to the block will be put into volumemap. Take blk_3148782682464391313 as an example. When datanode submits a block ID of 3148782637964391313 data block file, datanode will move tmp/blk_3148782682464391313 to a directory under current, taking subdir12 as an example, this is tmp/blk_3148782682464391313 and will be moved to current/subdir12/blk_3148782621364391313. The corresponding meta file is also in the current/subdir12 directory.

 

Reference URL

Http://dongyajun.iteye.com/blog/600841

Http://blog.csdn.net/xhh198781/article/details/7172649

Http://blog.jeoygin.org/2012/03/hdfs-source-analysis-3-datanode-storage.html

[Read hadoop source code] [8]-datanode-fsdataset

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.