Introduction to hadoop HDFS balancer

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop HDFS clusters are prone to unbalanced disk utilization between machines, such as adding new data nodes to clusters. When HDFS is unbalanced, many problems will occur, such as Mr.ProgramThe advantages of local computing cannot be well utilized, the network bandwidth usage between machines cannot be better, and the machine disk cannot be used. It can be seen that it is very important to ensure data balance in HDFS.

Hadoop contains a balancer program. By running this program, the HDFS cluster can be in a balanced state. The command for using this program is as follows:

Sh $ hadoop_home/bin/start-balancer.sh-T 10%

In this command, the difference between the-t parameter and the disk usage when HDFS reaches the equilibrium state is displayed. If the disk usage deviation between the machine and the machine is less than 10%, we think that the HDFS cluster has reached a balance.

Hadoop developers follow the following principles when developing the balancer program:

1. During data redistribution, data cannot be lost, the number of data backups cannot be changed, and the number of blocks in each rack cannot be changed.

2. the system administrator can run a command to start the data redistribution program or stop the data redistribution program.

3. Block cannot temporarily use too many resources, such as network bandwidth, during the process of moving.

4. the normal operation of Name node cannot be affected during execution of the Data redistribution program.

Based on these basic points, the current logic flow implemented by the hadoop data redistribution program is shown in:

The rebalance program is executed separately as an independent process and Name node.

1. The rebalance server obtains all data node information from the Name node: disk usage of each data node.

2. The rebalance server calculates which machines need to move data and which machines can accept the migrated data. Obtain the data distribution to be moved from the Name node.

3. The rebalance server calculates which machine block can be moved to another machine.

Machines with blocks to be moved, such as machines 5 and 6, move the data to the target machine and delete the block data on the machine.

7. The rebalance server obtains the execution result of this data movement and continues to execute this process. No data can be moved or the HDFS cluster has been reached the balance standard.

Hadoop's existing balancer program work in most cases.

Now we imagine the following situation:

1 Data is backed up in three copies.

2 HDFS consists of two rack.

The disk configurations of the machines in the 3 Rack are different. The disk space of each machine in the first rack is 1 TB, and the disk space of each machine in the second rack is 10 TB.

4 currently, two copies of most data are stored in the first rack.

Under such circumstances, the data in the HDFS-level group is definitely unbalanced. Now we run the balancer program, but we will find that the data in the entire HDFS cluster is still unbalanced after the execution ends: the disk space in rack1 is much smaller than rack2.

This is due to the development principle 1 of the balance program.

Simply put, when the balancer program is executed, one rack in the data will not be moved to another rack, so the balancer program will never be able to balance the HDFS cluster.

In this case, you can adopt the following solution:

1. continue to use the existing balancer program, but modify the machine distribution in rack. Splits machines with small disk space into different rack instances.

2. Modify the balancer program to allow you to change the number of blocks in each rack and reduce the number of blocks in the rack with an urgent disk space, you can also move it to another rack with sufficient disk space.

For more information about hadoopArticle, Can refer to: http://www.cnblogs.com/gpcuster/tag/Hadoop/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to hadoop HDFS balancer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to hadoop HDFS balancer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support