HDFS copy placement policy and rack awareness

Last Update:2015-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HDFS copy placement policy and rack awareness
Copy placement policy

The basic idea of the copy placement policy is:
The first block copy is placed in the node where the client is located. (If the client is not in the cluster range, the first node is randomly selected, of course, the system will try not to select nodes that are too full or too busy ).
The second copy is placed in a node in a different rack from the first node (randomly selected ).
The third copy and the second copy are in the same rack and randomly placed in different nodes.
If there are more copies, they are randomly placed in the node of the cluster.

Hadoop's copy placement policy balances reliability (blocks in different racks) and bandwidth (one pipeline only needs to traverse one network node. Is the distribution of the three datanode in the next pipeline when the backup parameter is 3.

Assembly line Replication

When the client writes data to the HDFS file, it first writes data to the local temporary file.
Assume that the copy coefficient of the file is set to 3. When the local temporary file accumulates to a data block, the client obtains a Datanode list from Namenode to store the copy. Then the client starts to transmit data to the first Datanode. The first Datanode receives data in a small part (4 KB) and writes each part to the local warehouse, this part is also transmitted to the Second Datanode node in the list. This is also true for the second Datanode. A small part receives data in a small part, writes data to a local warehouse, and transmits the data to the third Datanode at the same time. Finally, the third Datanode receives and stores the data locally. Therefore, Datanode can pipeline to receive data from the previous node and forward the data to the next node at the same time. The data is copied to the next Datanode by pipeline.

Rack awareness

Large Hadoop clusters are organized in the form of racks. The network conditions of different nodes on the same Rack are more ideal than those between different racks. In addition, NameNode tries to save block copies on different racks to improve fault tolerance.

Network Topology

With rack awareness, NameNode can plot the datanode network topology. D1 and R1 are both vswitches, and the underlying layer is datanode.
Then, rackid =/D1/R1/H1 of H1, parent of H1 is R1, and parent of R1 is D1. You can usetopology.script.file.nameConfiguration. With the rackid information, you can calculate the distance between two datanode.

Distance (/D1/R1/H1,/D1/R1/H1) = 0 same datanode
Distance (/D1/R1/H1,/D1/R1/H2) = 2 different datanode under the same rack
Distance (/D1/R1/H1,/D1/R1/H4) = 4 different datanode in the same IDC
Distance (/D1/R1/H1,/D2/R3/H7) = 6 datanode under different IDCs

Note:
1) When no rack information is configured, Hadoop of all machines is in the same default
The name is "/default-rack" under the rack. In this case, any datanode machine is considered to be in the same rack regardless of whether it physically belongs to the same rack.
2) Once configuredtopology.script.file.nameTo find the datanode according to the network topology.topology.script.file.nameThe value of this configuration option is specified as an executable program, usually a script.

How does Hadoop modify the size of HDFS file storage blocks?

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

Hadoop practice Chinese version + English version + Source Code [PDF]

Hadoop: The Definitive Guide (PDF]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HDFS copy placement policy and rack awareness

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HDFS copy placement policy and rack awareness

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support