Hadoop in-depth research: (iii)--HDFS data flow

Source: Internet
Author: User
Tags ack
The following subsections complement each other, and you will find a lot of interesting places to combine. Reprint please indicate source address: http://blog.csdn.net/lastsweetop/article/details/9065667
1. Topological distancesHere is a brief account of the computing distance of the network topology of Hadoop in a large number of scenarios, bandwidth is scarce resources, how to make full use of bandwidth, perfect computational cost and limiting factors are too many. Hadoop gives this solution:
Calculate the spacing between two nodes, use the closest distance node to operate, if you are familiar with the data structure, you can see here is an example of distance measurement algorithm. If it is represented by a data structure, it can be represented as a tree, and the distance calculation of two nodes is to find the calculation of the common ancestor. In reality, the typical scenario is as follows, the node of the tree structure is represented by the Datacenter Data Center, shown here as D1,D2, the rack rack, represented here as R1,R2,R3, and the server node nodes, represented here as N1,n2,n3,n4 1. Distance (d1/r1/n1,d1/r1/n1) =0 (same node) 2.distance (D1/R1/N1,D1/R1/N2) =2 (same rack with different nodes)
3.distance (D1/R1/N1,D1/R2/N3) =4 (different racks in the same data center)
4.distance (D1/R1/N1,D2/R3/N4) =6 (different data centers)
2. Copy storageFirst, to define the Namenode node to choose a Datanode node to store the block copy of the process is called the copy storage, the process of the strategy is actually in the reliability and read and write bandwidth between the trade-offs. So let's look at two extreme phenomena: 1. Keep all copies on the same node, write bandwidth is guaranteed, but this reliability is completely false, once this node hangs, the data is all gone, and the cross-rack read bandwidth is also very low. 2. All replicas are broken up on different nodes, reliability is improved, but bandwidth is a problem. Even in the same data center there are many kinds of copy storage scenarios, 0.17.0 provides a relatively balanced scenario, after 1.x the copy storage scheme is optional. Let's talk about the default scenario for Hadoop: 1. Place the first copy on the same node as the client and, if the client is not in the cluster, select a node to store it. 2. The second copy is randomly selected on a different rack than the first one 3. The third copy randomly selected a different node 4 on the same rack as the second copy. The remaining copies are completely random nodes. If the repetition factor is 3, a network topology such as the following figure is formed:
It can be seen that the scheme is reasonable: 1. Reliability: Blocks are stored on two racks 2. Write Bandwidth: Write only through a network switch 3. READ: Select one of the racks to read 4.block distributed across the cluster.
3. read File parsingFigure interpreting the data flow of a file
1. First call the FileSystem object's Open method, is actually a Distributedfilesystem instance 2.DistributedFileSystem through RPC to obtain the file of the first block of locations, The same block will return multiple locations according to the number of repetitions, and these locations are sorted by the Hadoop topology, near the front of the client. 3. The first two steps return a Fsdatainputstream object, which is encapsulated as a Dfsinputstream object, Dfsinputstream can easily manage datanode and namenode traffic. When the client calls the Read method, Dfsinputstream will most likely find the closest datanode to the client and connect (refer to the first section). 4. Data flows from Datanode to the client stream. 5. If the first piece of data is read, the Datanode connection to the first block is closed and the next block is read. These operations are transparent to the client, and the client's point of view is simply to read a continuous stream. 6. If the first block is read, Dfsinputstream will go to namenode take a batch of blocks location, and then continue to read, if all the blocks are read, then it will close all the flow.
If the communication between Dfsinputstream and Datanode is abnormal at the time of reading the data, the second near Datanode of the block that is being read is attempted, and the Datanode error is recorded. The remaining blocks will skip the Datanode as soon as they are read. Dfsinputstream also checks the block data checksum, if a bad block is found, it is reported to the Namenode node, and then Dfsinputstream read the block's image on the other Datanode
The direction of the design is that the client connects directly to the Datanode to retrieve the data and Namenode to be responsible for providing the optimal datanode,namenode for each block to handle only the block location request, This information is loaded in Namenode memory, and HDFs can withstand the concurrent access of a large number of clients through the Datanode cluster.
4. Write File parsingGraphical data flow for writing files
1. The client creates a new file by calling Distributedfilesystem's Create method 2.DistributedFileSystem through RPC call Namenode to create a new file without blocks association, before creating the Namenode will do various checks, such as whether the file exists, the client has no permissions to create and so on. If the checksum is passed, the Namenode will record the new file or an IO exception will be thrown. 3. The Fsdataoutputstream object is returned after the first two steps, similar to reading the file, Fsdataoutputstream is encapsulated into dfsoutputstream.dfsoutputstream to coordinate namenode and datanode. The client begins to write data to Dfsoutputstream,dfsoutputstream to cut the data into small packet, and then queue to data quene. 4.DataStreamer will be processed to accept data quene, he first inquiry Namenode this new block is most suitable for storage in which several Datanode (refer to the second section), such as repeat number is 3, then find 3 most suitable datanode, Lined them up in a pipeline. Datastreamer packet the output to the first datanode of the pipeline, the first Datanode outputs the packet to the second Datanode, and so on. 5.DFSOutputStream There is also a pair of columns called Ack Quene, also has a packet composition, waiting for Datanode to receive a response, when all Datanode in pipeline indicates that it has been received, then AKC Quene will remove the corresponding packet bag. If a datanode error occurs during the writing process, the following steps are taken: 1) pipeline is closed, 2) in order to prevent packet loss ACK Quene the packet will be synchronized to the data quene; 3) Delete The block that is currently written but not completed on the Datanode that produced the error, 4) The remainder of the block is written to the remaining two normal Datanode, 5) Namenode find another datanode to create the copy of the block. Of course, these operations are not perceptible to the client. 6. After the client finishes writing the data, call the Close method to close the write stream 7.DataStreamer the remaining packets are brushed into the pipeline and then wait for the ACK message, and after receiving the last Ack, notify DatanodeMark the document as completed.
In addition to note that the client after the write operation, the completion of the block is visible, the block being written is invisible to the client, only call the Sync method, the client to ensure that the file is written operation has been completed, The sync method is called by default when the client calls the Close method. Whether you need to call manually depends on the tradeoff between your data robustness and throughput rates depending on your program needs.

Thanks to Tom White, most of this article comes from the Great God's definitive guide, but the Chinese version of the translation is too bad, on the basis of the English original and some of the official documents to add some of their own understanding. It's all about reading notes, the superfluous lifting.
If my article is helpful to you, please use Alipay to reward:




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.