HDFS copy placement policy and rack awareness
Copy placement policy
The basic idea of the copy placement policy is:
The first block copy is placed in the node where the client is located. (If the client is not in the cluster range, the first node is randomly selected, of course, the system will try not to select nodes that are too full or too busy ).
The second copy is placed in a node in a different rack from the first node (randomly selected ).
The third copy and the second copy are in the same rack and randomly placed in different nodes.
If there are more copies, they are randomly placed in the node of the cluster.
Hadoop's copy placement policy balances reliability (blocks in different racks) and bandwidth (one pipeline only needs to traverse one network node. Is the distribution of the three datanode in the next pipeline when the backup parameter is 3.
Assembly line Replication
When the client writes data to the HDFS file, it first writes data to the local temporary file.
Assume that the copy coefficient of the file is set to 3. When the local temporary file accumulates to a data block, the client obtains a Datanode list from Namenode to store the copy. Then the client starts to transmit data to the first Datanode. The first Datanode receives data in a small part (4 KB) and writes each part to the local warehouse, this part is also transmitted to the Second Datanode node in the list. This is also true for the second Datanode. A small part receives data in a small part, writes data to a local warehouse, and transmits the data to the third Datanode at the same time. Finally, the third Datanode receives and stores the data locally. Therefore, Datanode can pipeline to receive data from the previous node and forward the data to the next node at the same time. The data is copied to the next Datanode by pipeline.
Large Hadoop clusters are organized in the form of racks. The network conditions of different nodes on the same Rack are more ideal than those between different racks. In addition, NameNode tries to save block copies on different racks to improve fault tolerance.
With rack awareness, NameNode can plot the datanode network topology. D1 and R1 are both vswitches, and the underlying layer is datanode.
Then, rackid =/D1/R1/H1 of H1, parent of H1 is R1, and parent of R1 is D1. You can use
topology.script.file.nameConfiguration. With the rackid information, you can calculate the distance between two datanode.
Distance (/D1/R1/H1,/D1/R1/H1) = 0 same datanode
Distance (/D1/R1/H1,/D1/R1/H2) = 2 different datanode under the same rack
Distance (/D1/R1/H1,/D1/R1/H4) = 4 different datanode in the same IDC
Distance (/D1/R1/H1,/D2/R3/H7) = 6 datanode under different IDCs
1) When no rack information is configured, Hadoop of all machines is in the same default
The name is "/default-rack" under the rack. In this case, any datanode machine is considered to be in the same rack regardless of whether it physically belongs to the same rack.
2) Once configured
topology.script.file.nameTo find the datanode according to the network topology.
topology.script.file.nameThe value of this configuration option is specified as an executable program, usually a script.
How does Hadoop modify the size of HDFS file storage blocks?
Copy local files to HDFS
Download files from HDFS to local
Upload local files to HDFS
Common commands for HDFS basic files
Introduction to HDFS and MapReduce nodes in Hadoop
Hadoop practice Chinese version + English version + Source Code [PDF]
Hadoop: The Definitive Guide (PDF]