Hadoop HDFS Load Balancing

Last Update:2016-05-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop HDFS Load Balancing
Hadoop HDFS

Hadoop Distributed File System (HDFS) is designed as a Distributed File System suitable for running on common hardware. It has a lot in common with the existing distributed file system. HDFS is a highly fault-tolerant file system that provides high-throughput data access and is very suitable for applications on large-scale datasets.

HDFS copy placement policy

The first copy is placed on the DataNode of the uploaded file. If it is submitted outside the cluster, a node with a low disk speed and a low CPU usage will be randomly selected;
The second copy is placed on nodes in different racks of the first copy;
Third copy: different nodes in the same rack as the second copy;
If there are more copies: randomly placed in the node;

Note that:

The number of copies of files stored in HDFS is determined by the number of copies set during file upload. No matter how you change the system copy coefficient in the future, the number of copies of this file will not change;
Prioritize the number of replicas specified in the startup command when uploading files, and use the default value set for dfs. replication in the hdfs-site.xml if not specified in the startup command;

HDFS Load Balancing

HDFS clusters of Hadoop are prone to unbalanced disk utilization between machines. For example, when a node is added or deleted in a cluster, or the hard disk storage of a node is saturated. When the data is not balanced, the Map task may be assigned to a machine that does not store data, which will cause the consumption of network bandwidth and the local computing will not be good.
When the HDFS load is not balanced, you need to adjust the HDFS data load balancing, that is, adjust the data storage distribution on each node machine. In this way, data is evenly distributed across various DataNode to balance IO performance and prevent hot spots from occurring. To adjust the load balancing of data, the following principles must be met:

Data balancing cannot reduce data blocks and data block backup is lost.
The administrator can abort the data balancing process.
The data volume and network resources occupied by each movement must be controllable.
The data balancing process does not affect the normal operation of namenode.

Principles of Hadoop HDFS data load balancing

The core of the data balancing process is a data balancing algorithm that continuously iterates the data balancing logic until the data in the cluster is balanced. The logic of each iteration of the data balancing algorithm is as follows:

The procedure is as follows:

The Rebalancing Server first requires NameNode to generate a DataNode data distribution analysis report to obtain the disk usage of each DataNode.
The Rebalancing Server summarizes the data distribution to be moved and calculates the specific data block migration roadmap. Data Block migration roadmap to ensure the shortest path in the Network
Start the Data block migration task. The Proxy Source Data Node needs to move the Data block to copy a Data block.
Copy the copied data block to the target DataNode.
Delete original data blocks
The target DataNode confirms to the Proxy Source Data Node that the Data block migration is complete.
The Proxy Source Data Node confirms to the Rebalancing Server that the Data block migration is complete. Then, continue the process until the cluster reaches the data balancing standard.

DataNode Group
In step 1, HDFS divides the current DataNode node into four groups based on the threshold value. When moving data blocks, blocks in the Over group and Above group move to the Below group and Under group. The four groups are defined as follows:

Over group: All DataNode in this group meets

DataNode_usedSpace_percent > Cluster_usedSpace_percent + threshold

Above group: All DataNode in this group meets

Cluster_usedSpace_percent + threshold > DataNode_ usedSpace _percent > Cluster_usedSpace_percent

Below group: All DataNode in this group meets

Cluster_usedSpace_percent > DataNode_ usedSpace_percent > Cluster_ usedSpace_percent – threshold

Under Group: All DataNode in this group meets

Cluster_usedSpace_percent – threshold > DataNode_usedSpace_percent

How to Use the Hadoop HDFS automatic data balancing script

In Hadoop, contains a start-balancer.sh script to start the HDFS data balancing service by running this tool. This tool can achieve hot swapping without restarting the computer and Hadoop services. The start-balancer. sh script in the Hadoop H ome/bin directory is the start script of the task. The startup command is 'start − balancer. sh script in the HadoopHome/bin directory, which is the startup script of the task. Start command: 'hadoop_home/bin/start-balancer.sh-threshold'

Several parameters that affect Balancer:

-Threshold
- Default Value: 10. value range: 0-100.
- Parameter description: threshold value used to determine whether a cluster is balanced. Theoretically, the smaller the parameter is set, the more balanced the entire cluster.
Dfs. balance. bandwidthPerSec
- Default setting: 1048576 (1 M/S)
- Parameter description: The bandwidth that Balancer can use during running

Example:

# Start data balancing, default threshold is 10% $ Hadoop_home/bin/start-balancer.sh # Start data balancing, threshold 5% bin/start-balancer.sh-threshold 5 # Stop data balancing $ Hadoop_home/bin/stop-balancer.sh

You can set network bandwidth limits for data balancing in the hdfs-site.xml File

<property>    <name>dfs.balance.bandwidthPerSec</name>    <value>1048576</value>    <description> Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second. </description>    </property>

How does Hadoop modify the size of HDFS file storage blocks?

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

Hadoop practice Chinese version + English version + Source Code [PDF]

Hadoop: The Definitive Guide (PDF]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop HDFS Load Balancing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop HDFS Load Balancing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support