Edge of hadoop source code: HDFS Data Communication Mechanism

Source: Internet
Author: User

It took some time to read the source code of HDFS. Yes.
However, there have been a lot of parsing hadoop source code on the Internet, so we call it "edge material", that is, some scattered experiences and ideas.

 

In short, HDFS is divided into three parts:
Namenode maintains the distribution of data on datanode and is also responsible for some scheduling tasks;
Datanode, where real data is stored;
Dfsclient, a client that accesses namenode and datanode through the interfaces provided by it;
The communication between the three is based on TCP/socket ;:

 

In the figure, the line indicates that there is communication between the two. One arrow indicates that the request is received, and one end without an arrow indicates that the request is initiated. The black line in the figure indicates that the message path is controlled, the red line indicates the data message path;

It can be seen that namenode is a typical server program, which is always in the status of accepting requests and returning responses. Namenode will never take the initiative to initiate requests to other components (it is rare to remember that this is also done in the GFS paper ). If namenode needs to send some scheduling or control commands to datanode, it must be returned to datanode as the response of heartbeat after datanode sends heartbeat to namenode.
Datanode is very busy. It not only needs to regularly send heartbeat to namenode, but also often comes with a lot of control messages to be processed. At the same time, datanode must receive requests for reading and writing data from dfsclient and some control requests. Finally, data messages and control messages are transmitted between datanode.

 

What is interesting in HDFS is that it controls message transmission and data message transmission using different modules. This is also the focus I will discuss next.

 

All control messages in HDFS are transmitted based on its self-implemented RPC module. Previous blog
I have introduced the implementation of this RPC. Here is a brief review:
The RPC module creates a socket for each node to communicate with. All control messages between the two nodes are transmitted through the socket. For the client end of the RPC, two threads are involved in this process. A thread is the thread that calls this RPC. It will write the message to the socket, and then it will implement wait () blocking; the other thread is responsible for reading the socket in the RPC module. It reads the message from the socket and then runs notify () to wake up the blocking thread;
As I mentioned in my previous blog, this mechanism is not suitable for transmission of large data volumes, because the two nodes only use one socket for communication, and the network throughput is not necessarily higher.

 

In fact, HDFS does not use the RPC mechanism to transmit data messages. When the dfsclient in HDFS reads and writes the file data stored on datanode, it actually uses another mechanism, which is briefly described as follows:

When each datanode is started, a thread dataxceiverserver is created to read and write the block data. What dataxceiverserver does is simple: Once a connection is established, a new dataxceiver is created to process the connection:

Public void run () {<br/> while (datanode. shouldrun) {<br/> try {<br/> socket S = ss. accept (); <br/> S. settcpnodelay (true); <br/> New daemon (datanode. threadgroup, <br/> New dataxceiver (S, datanode, this )). start (); <br/>} catch (sockettimeoutexception ignored) {<br/> // wake up to see if shocould continue to run <br/>} catch (ioexception IE) {<br/> //............ <br/>} catch (throwable Te) {<br/> //............ <br/>}< br/> try {<br/> SS. close (); <br/>}catch (ioexception IE) {<br/> //....... <br/>}< br/>

Dataxceiver is also a thread that processes a corresponding connection and mainly completes four types of tasks:
Opreadblock: Read a block
Opwriteblock: Write a block to the disk.
Opcopyblock: Read a block and send it to the specified destination.
Opreplaceblock: replace a block
Class dataxceiver extends datatransferprotocol. extends er <br/> implements runnable, fsconstants {<br/> //................ <br/>/** <br/> * read/write data from/to the dataxceiveserver. <br/> */<br/> Public void run () {<br/> updatecurrentthreadname ("waiting for Operation"); <br/> datainputstream in = NULL; <br/> try {<br/> In = new datainputstream (<br/> New bufferedinputstream (netutils. getinputstream (s), <br/> small_buffer_size); <br/> final datatransferprotocol. OP op = readop (in); <br/> // make sure the xciver count is not exceeded <br/> //.... <br/> processop (OP, In); <br/>} catch (throwable t) {<br/> log. error (datanode. dnregistration + ": dataxceiver", T); <br/>} finally {<br/> //..... <br/>}< br/>/** PROCESS Op by the corresponding method. */<br/> protected final void processop (OP op, datainputstream in <br/>) throws ioexception {<br/> switch (OP) {<br/> case read_block: <br/> opreadblock (in); <br/> break; <br/> case write_block: <br/> opwriteblock (in); <br/> break; <br/> case replace_block: <br/> opreplaceblock (in); <br/> break; <br/> case copy_block: <br/> opcopyblock (in ); <br/> break; <br/> case block_checksum: <br/> opblockchecksum (in); <br/> break; <br/> default: <br/> throw new ioexception ("unknown op" + OP + "in data stream"); <br/>}< br/>

Therefore, when HDFS transmits data, it creates a thread for each link for processing. In this way, if data transmission between two nodes is frequent, then multiple links may be created, and the throughput will go up.

 

If you are familiar with network server architecture, HDFS uses the one thread per request model. It does not use epoll-based event-driven architecture, which is currently popular. Even it does not use thread pool, but uses a "very earthy" model. As we all know, one obvious defect of the one thread per request model is that if the number of concurrent accesses is too high, a large number of threads may be generated, resulting in excessive context swith overhead between threads.
I personally think that HDFS adopts such a model. On the one hand, it is relatively simple to program, and on the other hand, it may be because developers think that a system like HDFS is not prone to high-concurrency access. There are two modules that need to interact with datanode: dfsclient and other datanode.
First, data message interaction between datanode and datanode only occurs in one case, that is, a dfsclient writes the block, then, the written datanode needs to copy the data to the other datanode where the block copy is located. Is a chain structure:
Dfsclient --> datanode A --> datanode B --> datanode C
Therefore, the links between datanode correspond to those between dfsclient and datanode one by one.
Besides, dfsclient is not the same as the client that the Web server is facing. The Web Serve client is a terminal browser, which may contain thousands or more, which is uncontrollable. Dfsclient is the client in the system, and the number of it is not too large (just like the number of database connections, which is controlled by developers, so it is not too large ). Because the number of dfsclient instances in the system is not too large, there will not be too many connections from the dfsclient. Since there are not many connections initiated by dfsclient, there will be no more connections between datanode.
Combined with the above two points, the entire HDFS is very difficult to produce high concurrency, so the one thread per request architecture is the same.

 

-- End --

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.