"Reprint" Ramble about Hadoop HDFS BALANCER

Source: Internet
Author: User
Tags disk usage

Hadoop's HDFs clusters are prone to unbalanced disk utilization between machines and machines, such as adding new data nodes to a cluster. When there is an imbalance in HDFs, there are a lot of problems, such as the Mr Program does not take advantage of local computing, the machine is not able to achieve better network bandwidth utilization, the machine disk can not be used and so on. It is important to ensure that the data in HDFs is balanced.

In Hadoop, there is a balancer program that, by running this program, can bring the HDFs cluster to a balanced state, and the command to use this program is as follows:

SH $HADOOP _home/bin/start-balancer.sh–t 10%

The-t parameter in this command is followed by the value of the disk usage bias in which HDFs reaches the equilibrium state. If the disk usage bias between machines and machines is less than 10%, then we think that the HDFs cluster has reached a balanced state.

Hadoop developers have followed the following principles when developing balancer programs:

1. In the process of data redistribution, it is necessary to ensure that the data can not be lost, can not change the number of backup data, can not change the number of blocks in each rack.

2. The system administrator can start the data redistribution program with a single command or stop the data redistribution program.

3. Block cannot take up too many resources, such as network bandwidth, during the move.

4. The Data redistribution program does not affect the normal operation of name node during execution.

Based on these basic points, the current Hadoop data redistribution program implementation of the logical flow as shown:

The rebalance program is executed separately from the name node as a separate process.

1 Rebalance server gets all the data node conditions from name node: Each data node disk usage.

2 Rebalance server calculates which machines need to move data and which machines can accept moving data. and get the data distribution that needs to be moved from name node.

3 Rebalance Server calculates which block of the machine can be moved to another machine.

4,5,6 need to move the block's machine to move the data to the purpose of the machine, while deleting the block data on its own machine.

7 Rebalance server Gets the execution results of this data movement and continues the process, with no data available to move or HDFs clusters and to achieve a balanced standard.

The way Hadoop currently works with this balancer program is well suited in most cases.

Now we envisage a situation in which:

1 data is 3 copies of the backup.

2 HDFs consists of 2 rack.

3 2 Rack The machine disk configuration is different, the first rack each machine disk space is 1TB, the second rack each machine's disk space is 10TB.

4 Now 2 copies of most data are stored in the first rack.

In such a case, the data in the HDFs class group is definitely unbalanced. Now we run the Balancer program, but we find that the data in the entire HDFS cluster is still unbalanced after the run: Rack1 is much smaller than Rack2.

This is due to the development principle of the Balance program 1.

Simply put, in the execution of the Balancer program, you do not move one rack in the data to another rack, so the Balancer program will never be able to balance the HDFs cluster.

For this scenario, you can take the 2 scenario:

1 continue to use the existing Balancer program, but modify the machine distribution in the rack. Fork a machine with small disk space to a different rack.

2 Modify the Balancer program to allow you to change the number of blocks in each rack, reduce the amount of block stored in the rack of disk space, or move it to a rack of additional disk space.

For more articles on Hadoop, refer to: http://www.cnblogs.com/gpcuster/tag/Hadoop/

  Tags: Hadoop

"Reprint" Ramble about Hadoop HDFS BALANCER

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.