Introduction to hadoop HDFS balancer

Source: Internet
Author: User

Hadoop HDFS clusters are prone to unbalanced disk utilization between machines, such as adding new data nodes to clusters. When HDFS is unbalanced, many problems will occur, such as Mr.ProgramThe advantages of local computing cannot be well utilized, the network bandwidth usage between machines cannot be better, and the machine disk cannot be used. It can be seen that it is very important to ensure data balance in HDFS.

Hadoop contains a balancer program. By running this program, the HDFS cluster can be in a balanced state. The command for using this program is as follows:

Sh $ hadoop_home/bin/start-balancer.sh-T 10%

In this command, the difference between the-t parameter and the disk usage when HDFS reaches the equilibrium state is displayed. If the disk usage deviation between the machine and the machine is less than 10%, we think that the HDFS cluster has reached a balance.

Hadoop developers follow the following principles when developing the balancer program:

1. During data redistribution, data cannot be lost, the number of data backups cannot be changed, and the number of blocks in each rack cannot be changed.

2. the system administrator can run a command to start the data redistribution program or stop the data redistribution program.

3. Block cannot temporarily use too many resources, such as network bandwidth, during the process of moving.

4. the normal operation of Name node cannot be affected during execution of the Data redistribution program.

Based on these basic points, the current logic flow implemented by the hadoop data redistribution program is shown in:

The rebalance program is executed separately as an independent process and Name node.

1. The rebalance server obtains all data node information from the Name node: disk usage of each data node.

2. The rebalance server calculates which machines need to move data and which machines can accept the migrated data. Obtain the data distribution to be moved from the Name node.

3. The rebalance server calculates which machine block can be moved to another machine.

Machines with blocks to be moved, such as machines 5 and 6, move the data to the target machine and delete the block data on the machine.

7. The rebalance server obtains the execution result of this data movement and continues to execute this process. No data can be moved or the HDFS cluster has been reached the balance standard.

Hadoop's existing balancer program work in most cases.

Now we imagine the following situation:

1 Data is backed up in three copies.

2 HDFS consists of two rack.

The disk configurations of the machines in the 3 Rack are different. The disk space of each machine in the first rack is 1 TB, and the disk space of each machine in the second rack is 10 TB.

4 currently, two copies of most data are stored in the first rack.

Under such circumstances, the data in the HDFS-level group is definitely unbalanced. Now we run the balancer program, but we will find that the data in the entire HDFS cluster is still unbalanced after the execution ends: the disk space in rack1 is much smaller than rack2.

This is due to the development principle 1 of the balance program.

Simply put, when the balancer program is executed, one rack in the data will not be moved to another rack, so the balancer program will never be able to balance the HDFS cluster.

In this case, you can adopt the following solution:

1. continue to use the existing balancer program, but modify the machine distribution in rack. Splits machines with small disk space into different rack instances.

2. Modify the balancer program to allow you to change the number of blocks in each rack and reduce the number of blocks in the rack with an urgent disk space, you can also move it to another rack with sufficient disk space.

For more information about hadoopArticle, Can refer to: http://www.cnblogs.com/gpcuster/tag/Hadoop/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.