The RPC mechanism described above is not used for receiving/sending data blocks on DataNode. The reason is very simple. RPC is an imperative interface, while DataNode processes data, it is often a stream mechanism. DataXceiverServer and DataXceiver are the implementation of this mechanism. DataXceiver also relies on two auxiliary classes: BlockSender and BlockReceiver. The following is a class diagram.
DataXceiverServer
DataXceiverServer is relatively simple. You can create a ServerSocket to accept the request. Each time you accept a connection, you can create a DataXceiver to process the request and store the Socket in a Map named childSockets. In addition, a BlockBalanceThrottler object is created to control the number of DataXceiver and traffic balancing. The logic is mainly in the run function.
It is worth mentioning that there are several parameters:
Dfs. datanode. max. xcievers: The default value is 256. Each node can have up to many DataXceiver instances. If there are too many nodes, the memory may be insufficient.
Dfs. block. size: The block size is required here because you need to estimate whether the partition has enough space.
Dfs. balance. bandwidthPerSec: the network bandwidth for copying blocks between nodes when using start-balancer.sh. The default value is 1 M/s.
BlockBalanceThrottler (extend BlockTransferThrottler) is used for bandwidth control.
DataXceiver
The main process of DataXceiver is in the run function:
public void run(){ DataInputStream in = null; try { in = new DataInputStream(new BufferedInputStream(NetUtils.getInputStream(s), SMALL_BUFFER_SIZE)); short version = in.readShort(); byte op = in.readByte(); int curXceiverCount = datanode.getXceiverCount(); if (curXceiverCount > dataXceiverServer.maxXceiverCount) { throw new IOException("xceiverCount " + curXceiverCount + " exceeds the limit of concurrent xcievers " + dataXceiverServer.maxXceiverCount); } switch (op) { case DataTransferProtocol.OP_READ_BLOCK: readBlock(in); break; case DataTransferProtocol.OP_WRITE_BLOCK: writeBlock(in); break; case DataTransferProtocol.OP_READ_METADATA: readMetadata(in); break; case DataTransferProtocol.OP_REPLACE_BLOCK: // for balancing purpose; send to a destination replaceBlock(in); break; case DataTransferProtocol.OP_COPY_BLOCK: copyBlock(in); break; case DataTransferProtocol.OP_BLOCK_CHECKSUM: // get the checksum of a block getBlockChecksum(in); break; default: throw new IOException("Unknown opcode " + op + " in data stream"); } }}
1. From the above we can see that the data in the request header sent by the client is as follows:
+ ---------------------------------------------- +
| 2 bytes version | 1 byte OP |
+ ---------------------------------------------- +
2. OP supports six operations, which are defined in DataTransferProtocol.
Public static final byte OP_WRITE_BLOCK = (byte) 80;
Public static final byte OP_READ_BLOCK = (byte) 81;
Public static final byte OP_READ_METADATA = (byte) 82;
Public static final byte OP_REPLACE_BLOCK = (byte) 83;
Public static final byte OP_COPY_BLOCK = (byte) 84;
Public static final byte OP_BLOCK_CHECKSUM = (byte) 85;
The most complex operations are read and write.
OP_READ_METADATAIt is a read block metadata operation. The header data of the client request is as follows:
+ ------------------------------------------------ +
| 8 byte Block ID | 8 byte genstamp |
+ ------------------------------------------------ +
The returned data is as follows:
+ ------------------------------------------------ +
| 1 byte status | 4 byte length of metadata |
+ ------------------------------------------------ +
| Meta data | 0 |
+ ------------------------------------------------ +
OP_REPLACE_BLOCKThe replacement block data operation is mainly used for load balancing. DataXceiver receives a block and writes it to the disk. After the operation is completed, it notifies namenode to delete the source data block. The header data requested by the client is as follows:
+ ------------------------------------------------ +
| 8 byte Block ID | 8 byte genstamp |
+ ------------------------------------------------ +
| 4 byte length | source node id |
+ ------------------------------------------------ +
| Source data node |
+ ----------------------- +
The specific process is as follows: Send a copy block request to source datanode (OP_COPY_BLOCK), Then receives the response from the source datanode, creates a blockcycler to receive block data, and finally notifies the namenode that the block data has been received.
OP_COPY_BLOCKIt is a block data replication operation. It is mainly used for load balancing to send block data to the datanode that initiates the request. DataXceiver creates a BlockReceiver object to send data. The request header data is shown as follows:
+ ------------------------------------------------ +
| 8 byte Block ID | 8 byte genstamp |
+ ------------------------------------------------ +
OP_BLOCK_CHECKSUMIs to obtain the block checksum operation, do the MD5 Digest for all the block checksum, the header data of the client request is as follows:
+ ------------------------------------------------ +
| 8 byte Block ID | 8 byte genstamp |
+ ------------------------------------------------ +
The returned data is as follows:
+ ------------------------------------------------ +
| 2 byte status | 4 byte bytes per CRC |
+ ------------------------------------------------ +
| 8 byte CRC per block | 16 byte md5 digest |
+ ------------------------------------------------ +
OP_READ_BLOCKIt is a block data read operation. The header data requested by the client is as follows. DataXceiver creates a BlockSender object to send data to the client.
+ ----------------------------------- +
| 8 byte Block ID | 8 byte genstamp |
+ ----------------------------------- +
| 8 byte start offset | 8 byte length |
+ ----------------------------------- +
| 4 byte length | <DFSClient id> |
+ ----------------------------------- +
OP_WRITE_BLOCKIt is a block data write operation. The header data requested by the client is as follows. DataXceiver creates a BlockReceiver object to receive data from the client.
+ ------------------------------------------------ +
| 8 byte Block ID | 8 byte genstamp |
+ ------------------------------------------------ +
| 4 byte num of datanodes in entire pipeline |
+ ------------------------------------------------ +
| 1 byte is recovery | 4 byte length |
+ ------------------------------------------------ +
| <DFSClient id> | 1 byte has src node |
+ ------------------------------------------------ +
| Src datanode info | 4 byte num of targets |
+ ------------------------------------------------ +
| Target datanodes |
+ ----------------------- +
Writing block data is a complicated operation. A Simple Sequence Chart can be illustrated as follows:
This process is actually quite complicated, and the code is not very careful.
Reference url
Http://hi.baidu.com/hovlj_1130/blog/item/20200da530603af99052eed9.html
Http://www.cnblogs.com/serendipity/archive/2012/03/03/2378639.html
Http://blog.jeoygin.org/2012/03/hdfs-source-analysis-4-datanode-dataxceiver.html
Http://caibinbupt.iteye.com/blog/286533
Http://caibinbupt.iteye.com/blog/286259
Http://caibinbupt.iteye.com/blog/286533