Hadoop HDFS Load Balancing

Source: Internet
Author: User

Hadoop HDFS Load Balancing
Hadoop HDFS

Hadoop Distributed File System (HDFS) is designed as a Distributed File System suitable for running on common hardware. It has a lot in common with the existing distributed file system. HDFS is a highly fault-tolerant file system that provides high-throughput data access and is very suitable for applications on large-scale datasets.

HDFS copy placement policy

The first copy is placed on the DataNode of the uploaded file. If it is submitted outside the cluster, a node with a low disk speed and a low CPU usage will be randomly selected;
The second copy is placed on nodes in different racks of the first copy;
Third copy: different nodes in the same rack as the second copy;
If there are more copies: randomly placed in the node;

Note that:

  • The number of copies of files stored in HDFS is determined by the number of copies set during file upload. No matter how you change the system copy coefficient in the future, the number of copies of this file will not change;
  • Prioritize the number of replicas specified in the startup command when uploading files, and use the default value set for dfs. replication in the hdfs-site.xml if not specified in the startup command;
HDFS Load Balancing

HDFS clusters of Hadoop are prone to unbalanced disk utilization between machines. For example, when a node is added or deleted in a cluster, or the hard disk storage of a node is saturated. When the data is not balanced, the Map task may be assigned to a machine that does not store data, which will cause the consumption of network bandwidth and the local computing will not be good.
When the HDFS load is not balanced, you need to adjust the HDFS data load balancing, that is, adjust the data storage distribution on each node machine. In this way, data is evenly distributed across various DataNode to balance IO performance and prevent hot spots from occurring. To adjust the load balancing of data, the following principles must be met:

  • Data balancing cannot reduce data blocks and data block backup is lost.
  • The administrator can abort the data balancing process.
  • The data volume and network resources occupied by each movement must be controllable.
  • The data balancing process does not affect the normal operation of namenode.
Principles of Hadoop HDFS data load balancing

The core of the data balancing process is a data balancing algorithm that continuously iterates the data balancing logic until the data in the cluster is balanced. The logic of each iteration of the data balancing algorithm is as follows:

The procedure is as follows:

  1. The Rebalancing Server first requires NameNode to generate a DataNode data distribution analysis report to obtain the disk usage of each DataNode.
  2. The Rebalancing Server summarizes the data distribution to be moved and calculates the specific data block migration roadmap. Data Block migration roadmap to ensure the shortest path in the Network
  3. Start the Data block migration task. The Proxy Source Data Node needs to move the Data block to copy a Data block.
  4. Copy the copied data block to the target DataNode.
  5. Delete original data blocks
  6. The target DataNode confirms to the Proxy Source Data Node that the Data block migration is complete.
  7. The Proxy Source Data Node confirms to the Rebalancing Server that the Data block migration is complete. Then, continue the process until the cluster reaches the data balancing standard.



DataNode Group
In step 1, HDFS divides the current DataNode node into four groups based on the threshold value. When moving data blocks, blocks in the Over group and Above group move to the Below group and Under group. The four groups are defined as follows:

  • Over group: All DataNode in this group meets

DataNode_usedSpace_percent > Cluster_usedSpace_percent + threshold

  • Above group: All DataNode in this group meets

Cluster_usedSpace_percent + threshold > DataNode_ usedSpace _percent > Cluster_usedSpace_percent

  • Below group: All DataNode in this group meets

Cluster_usedSpace_percent > DataNode_ usedSpace_percent > Cluster_ usedSpace_percent – threshold

  • Under Group: All DataNode in this group meets

Cluster_usedSpace_percent – threshold > DataNode_usedSpace_percent

How to Use the Hadoop HDFS automatic data balancing script

In Hadoop, contains a start-balancer.sh script to start the HDFS data balancing service by running this tool. This tool can achieve hot swapping without restarting the computer and Hadoop services. The start-balancer. sh script in the Hadoop H ome/bin directory is the start script of the task. The startup command is 'start − balancer. sh script in the HadoopHome/bin directory, which is the startup script of the task. Start command: 'hadoop_home/bin/start-balancer.sh-threshold'

Several parameters that affect Balancer:

  • -Threshold
    • Default Value: 10. value range: 0-100.
    • Parameter description: threshold value used to determine whether a cluster is balanced. Theoretically, the smaller the parameter is set, the more balanced the entire cluster.
  • Dfs. balance. bandwidthPerSec
    • Default setting: 1048576 (1 M/S)
    • Parameter description: The bandwidth that Balancer can use during running

Example:

# Start data balancing, default threshold is 10% $ Hadoop_home/bin/start-balancer.sh # Start data balancing, threshold 5% bin/start-balancer.sh-threshold 5 # Stop data balancing $ Hadoop_home/bin/stop-balancer.sh

You can set network bandwidth limits for data balancing in the hdfs-site.xml File

<property>    <name>dfs.balance.bandwidthPerSec</name>    <value>1048576</value>    <description> Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second. </description>    </property>

How does Hadoop modify the size of HDFS file storage blocks?

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

Hadoop practice Chinese version + English version + Source Code [PDF]

Hadoop: The Definitive Guide (PDF]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.