The write process for HDFs

Source: Internet
Author: User
Tags ack

An important part of Hadoop, HDFs, which plays an important role in the back-end storage of files. HDFs is targeted at low-end servers, where there are many read operations and less write operations. In the case of distributed storage, it is more likely that the data is damaged, in order to ensure the reliability and integrity of the data, the data inspection and (checksum) and multi-copy placement strategy are implemented. In HDFs, the use of more than the CRC (cyclic redundancy check code) test and mode. HDFs is a block-based distributed file system with other Distributed file system Ceph based on the object.

Here are a few of the concepts involved in the write operation in HDFs:

1, Block: In HDFs, each file is stored in a block way, each block placed on a different datanode, each block's identity is a ternary group (block ID, Numbytes,generationstamp), where the block ID is unique, the specific assignment is set by the Namenode node, and then the block file is established by the Datanode, and the corresponding block meta file (the content, most of which is the test and content of the corresponding data) is established.

2, packet: In the process of communication between Dfsclient and Datanode, the process of transmitting and receiving data is conducted in a packet-based manner.

3, Chunk: The corresponding Chinese name can also be called block, but in order to distinguish with block, or called Chunk. In the process of communication between Dfsclient and Datanode, the file is made in block-based manner, but in the process of sending the data in a packet way, each packet contains multiple chunk, At the same time for each chunk checksum calculation, generate checksum bytes.

here is the data format for packet:

+-----------------------------------------------------------------------+
| 4 byte packet length (Exclude packet header) |
+-----------------------------------------------------------------------+
| 8 byte offset in the Block | 8 Byte Sequence Number |
+-----------------------------------------------------------------------+
| 1 byte Islastpacketinblock |
+-----------------------------------------------------------------------+
| 4 byte Length of actual data |
+-----------------------------------------------------------------------+
| x byte checksum data. X is defined below |
+-----------------------------------------------------------------------+
| Actual data .... |
+-----------------------------------------------------------------------+

x = (length of data + byte_per_checksum-1)/byte_per_checksum * checksum_size

The communication between Dfsclient and Datanode is based on the C/s structure. There are Fsdataoutputstream, Dfsoutputstream, Fsoutputsummer in the important classes involved in the write operation in Dfsclient.

The following is an analysis of the dfsclient write data process:

1, establish the Dfsclient type, and then through the ClientProtocol remote Procedure call method Addblock, get concrete block information, return to Dfsclient

2, through the return of Locatedblock information to establish pipleline on multiple Datanode

+------------+ Connect +------------+ Connect +------------+ dfsclient | |---------->|            |----------->|   | -------> |   Datanode |  Ack |    Datanode |  Ack |   Datanode | <------- | |<----------|            |<-----------| |   Ack +------------+           +------------+            +------------+

Dfsclient need to establish a connection with multiple Datanode before sending the packet, so that the data can be transferred after the pipeline is established

3, dfsclient to send data using the Fsoutputsummer class of write (byte[]buffer, int offset,int len), When reaching a chunk call Dfsoutputstream's Writechunk (byte[] b, int offset,int len,byte[] checksum), when a packet is reached, the packet is completed, The next step is to go into the send queue and send the data by the sending thread of the Datastreamer class object (where the connection between establishing the data pipeline with multiple datanode is done by Datastreamer, when a block is completed, Start building another pipeline, repeat the process above)

Additional note: In the process of establishing a connection between the above Dfsclient and Datanode, the request message header and the dfsclient are similar to the header of the message that establishes the read request between the Datanode. No more instructions here.

with the dfsclient, corresponding to the need to have datanode on the corresponding Guardian thread to receive data, the following is the reception process analysis on the Datanode:

1. An important thread dataxceiverserver is started on Datanode to listen for connection requests from dfsclient or other Datanode

2, when the dfsclient need to make a connection, dataxceiverserver response to the request, set up a dataxceiver thread, to serve, the next step is to establish a pipeline connection pipeline, to verify the request message headers sent over the header information

3, if the received request is op_write_block, then the establishment of the Blockreceiver class object, to accept the specific data, but also accept from the checksum, two parts of different data in different parts, The data section is placed in a standard block file, and the checksum data is kept in a file with a meta suffix, which is used for subsequent data integrity checks (either Dfsclient or Blockscanner authentication)

The write process for HDFs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.