HDFS copy placement policy and rack awareness

Source: Internet
Author: User

HDFS copy placement policy and rack awareness
Copy placement policy

The basic idea of the copy placement policy is:
The first block copy is placed in the node where the client is located. (If the client is not in the cluster range, the first node is randomly selected, of course, the system will try not to select nodes that are too full or too busy ).
The second copy is placed in a node in a different rack from the first node (randomly selected ).
The third copy and the second copy are in the same rack and randomly placed in different nodes.
If there are more copies, they are randomly placed in the node of the cluster.

Hadoop's copy placement policy balances reliability (blocks in different racks) and bandwidth (one pipeline only needs to traverse one network node. Is the distribution of the three datanode in the next pipeline when the backup parameter is 3.

Assembly line Replication

When the client writes data to the HDFS file, it first writes data to the local temporary file.
Assume that the copy coefficient of the file is set to 3. When the local temporary file accumulates to a data block, the client obtains a Datanode list from Namenode to store the copy. Then the client starts to transmit data to the first Datanode. The first Datanode receives data in a small part (4 KB) and writes each part to the local warehouse, this part is also transmitted to the Second Datanode node in the list. This is also true for the second Datanode. A small part receives data in a small part, writes data to a local warehouse, and transmits the data to the third Datanode at the same time. Finally, the third Datanode receives and stores the data locally. Therefore, Datanode can pipeline to receive data from the previous node and forward the data to the next node at the same time. The data is copied to the next Datanode by pipeline.

Rack awareness

Large Hadoop clusters are organized in the form of racks. The network conditions of different nodes on the same Rack are more ideal than those between different racks. In addition, NameNode tries to save block copies on different racks to improve fault tolerance.

Network Topology

With rack awareness, NameNode can plot the datanode network topology. D1 and R1 are both vswitches, and the underlying layer is datanode.
Then, rackid =/D1/R1/H1 of H1, parent of H1 is R1, and parent of R1 is D1. You can usetopology.script.file.nameConfiguration. With the rackid information, you can calculate the distance between two datanode.

Distance (/D1/R1/H1,/D1/R1/H1) = 0 same datanode
Distance (/D1/R1/H1,/D1/R1/H2) = 2 different datanode under the same rack
Distance (/D1/R1/H1,/D1/R1/H4) = 4 different datanode in the same IDC
Distance (/D1/R1/H1,/D2/R3/H7) = 6 datanode under different IDCs

Note:
1) When no rack information is configured, Hadoop of all machines is in the same default
The name is "/default-rack" under the rack. In this case, any datanode machine is considered to be in the same rack regardless of whether it physically belongs to the same rack.
2) Once configuredtopology.script.file.nameTo find the datanode according to the network topology.topology.script.file.nameThe value of this configuration option is specified as an executable program, usually a script.

How does Hadoop modify the size of HDFS file storage blocks?

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

Hadoop practice Chinese version + English version + Source Code [PDF]

Hadoop: The Definitive Guide (PDF]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.