Block data balancer re-distribution in HDFs

Source: Internet
Author: User
Tags disk usage

When Hadoop 's HDFS cluster is used for a period of time, the disk usage of each DataNode node is definitely unbalanced, i.e. data skew at the data volume level,

There are many ways to cause this:

1. Add a new Datanode node

2. human intervention reduces or increases the number of copies of data

We all know that when the data imbalance occurs in HDFS , it can cause applications such as MapReduce or Spark not to take advantage of local computing, and Datanode There is no better network bandwidth utilization between the nodes, and some Datanode nodes are not available for disk usage and so on.

In Hadoop , the HDFs balancer Program is provided to ensure the data balance of HDFs , so let's take a look at the parameters of this program:

HDFs Balancer --help

Usage:hdfs Balancer

[-policy <policy>] The balancing Policy:datanode or Blockpool

[-threshold <threshold>] Percentage of disk capacity

[-exclude [-F

[-include [-F

[-idleiterations <idleiterations>] Number of consecutive idle iterations ( -1 for Infinite) before exit.

[-runduringupgrade] Whether to run the balancer during an ongoing HDFS upgrade. This is usually not desired since it won't affect used space on over-utilized machines.

Generic options supported are

-conf <configuration file> Specify an application configuration file

-D <property=value> Use value for given property

-fs <local|namenode:port> Specify a Namenode

-JT <local|resourcemanager:port> Specify a ResourceManager

-files <comma separated list of files> specify comma separated files to being copied to the map reduce cluster

-libjars <comma separated list of jars> specify Comma separated jar files to include in the classpath.

-archives <comma separated list of archives> specify Comma separated archives to being unarchived on the compute mac Hines.

The General Command line syntax is

Bin/hadoop command [genericoptions] [commandoptions]

The meaning of the option should be well understood according to the description, where the -threshold parameter is the basis for judging the data balance and the value range is 0-100. The default value is ten, which means that the value of the disk usage bias for HDFs to equilibrium is 10%, and if the disk usage bias between machines and machines is less than 10%, then we think that HDFs the cluster has reached a balanced state.

We can see from the CM on the CDH platform that this parameter is the default value and meaning:

The specific meaning of this parameter is: To determine whether the cluster is balanced by the target parameters, each Datanode storage usage and total cluster storage utilization of the difference should be less than this threshold, theoretically, the smaller the parameter setting, the more balanced the entire cluster, but in the online environment, when the Hadoop cluster is balance , the data is written and deleted concurrently, so it is possible that the set balance parameter value cannot be reached.

The parameter -policy represents the balance policy, which defaults to DataNode.

The specific meaning of this parameter is: the policy applied to rebalance HDFS storage. The default DataNode policy balances storage at the DataNode level. This is similar to the balance strategy of the previous release. The blockpool policy balances storage at the block pool level and at the DataNode level. The Blockpool policy applies only to Federated HDFS Services.

Parameters -exclude and -include are used to select balancer , you can specify which DataNode to redistribute between, or from HDFS the cluster excludes which nodes do not need to be re-distributed, such as:

HDFs Balancer-include Cdhd,cdha,cdhm,cdht,cdho

In addition to the above parameters will affect the HDFS data redistribution, as well as the following parameters will also affect the redistribution,

Dfs.datanode.balance.bandwidthPerSec, Dfs.balance.bandwidthPerSec

The default setting:1048576 (1m/s), personal advice if the machine's network card and switch bandwidth is limited, you can properly reduce the speed, the general default on it.

This parameter has the following meanings:

The HDFS balancer detects over-or under- utilized DataNode in the clusterand moves data blocks between these DataNode to ensure load balancing. If the balance operation is not bandwidth constrained, it will quickly preempt all network resources and will not reserve resources for Mapreduce jobs or data entry. The parameter dfs.balance.bandwidthPerSec defines the maximum bandwidth allowed for each DataNode balancing operation, which is in Byte, which is not intuitive, Because the network bandwidth is generally used to describe the bit. Therefore, in the setting, the first calculation is good. DataNode uses this parameter to control the use of network bandwidth, but unfortunately, this parameter is enrolled when the daemon is started, causing the administrator to not be able to modify this value during the balancing run, and restart the cluster if adjustments are needed.

Here's a quick introduction Balancer the principle of:

The Rebalance program is executed separately from the NameNode as a separate process.

Step 1:

Rebalance Server gets all the DataNode conditions from NameNode : Every DataNode disk usage.

Step 2:

Rebalance Server calculates which machines need to move data and which machines can accept moving data. And get the distribution of data that needs to be moved from NameNode .

Step 3:

Rebalance Server calculates which machine's block can be moved to another machine.

Step 4,5,6:

The machine that needs to move the block will move the data to the machine and delete the block data on its own machine.

Step 7:

Rebalance Server obtains the execution result of this data movement and continues the process, with no data to move or HDFS clusters and to achieve a balanced standard.

Actual combat:

To find a relatively idle Datanode execution, it is recommended not to perform at NameNode :

HDFs Balancer-include Cdhd,cdha,cdhm,cdht,cdho

The procedure is as follows ( part ), you can read the log against the above process, may be clearer:

16/07/11 09:35:12 INFO balancer. Balancer: namenodes = [hdfs://cdhb:8022]

16/07/11 09:35:12 INFO balancer. Balancer:parameters = balancer.parameters [Balancingpolicy.node, threshold = 10.0, max idle iteration = 5, number of Node s to being excluded = 0, number of nodes to be included = 5, run during upgrade = False]

Time Stamp iteration# Bytes already Moved Bytes left to Move Bytes Being Moved

16/07/11 09:35:14 INFO net.NetworkTopology:Adding A new node:/default/192.168.1.130:50010

16/07/11 09:35:14 INFO net.NetworkTopology:Adding A new node:/default/192.168.1.131:50010

16/07/11 09:35:14 INFO net.NetworkTopology:Adding A new node:/default/192.168.1.135:50010

16/07/11 09:35:14 INFO net.NetworkTopology:Adding A new node:/default/192.168.1.138:50010

16/07/11 09:35:14 INFO net.NetworkTopology:Adding A new node:/default/192.168.1.139:50010

16/07/11 09:35:14 INFO balancer. Balancer: 2 over-utilized: [192.168.1.130:50010:disk, 192.168.1.135:50010:disk]

16/07/11 09:35:14 INFO balancer. Balancer: 1 underutilized: [192.168.1.131:50010:disk]

16/07/11 09:35:14 INFO balancer. Balancer:need to move 203.48 GB to make the cluster balanced.

16/07/11 09:35:14 INFO balancer. Balancer:decided to move ten GB bytes from 192.168.1.130:50010:disk to 192.168.1.131:50010:disk

16/07/11 09:35:14 INFO balancer. Balancer:decided to move ten GB bytes from 192.168.1.135:50010:disk to 192.168.1.138:50010:disk

16/07/11 09:35:14 INFO balancer. Balancer:will Move (GB in) iteration

16/07/11 09:36:00 INFO balancer. Dispatcher:successfully moved blk_1074048042_307309 with size=134217728 from 192.168.1.130:50010:disk to 192.168.1.131:50010:disk through 192.168.1.130:50010

16/07/11 09:36:07 INFO balancer. Dispatcher:successfully moved blk_1074049886_309153 with size=134217728 from 192.168.1.135:50010:disk to 192.168.1.138:50010:disk through 192.168.1.135:50010

16/07/11 09:36:09 INFO balancer. Dispatcher:successfully moved blk_1074048046_307313 with size=134217728 from 192.168.1.130:50010:disk to 192.168.1.131:50010:disk through 192.168.1.130:50010

16/07/11 09:36:10 INFO balancer. Dispatcher:successfully moved blk_1074049900_309167 with size=134217728 from 192.168.1.135:50010:disk to 192.168.1.138:50010:disk through 192.168.1.135:50010

16/07/11 09:36:16 INFO balancer. Dispatcher:successfully moved blk_1074048061_307328 with size=134217728 from 192.168.1.130:50010:disk to 192.168.1.131:50010:disk through 192.168.1.130:50010

16/07/11 09:36:17 INFO balancer. Dispatcher:successfully moved blk_1074049877_309144 with size=134217728 from 192.168.1.135:50010:disk to 192.168.1.138:50010:disk through 192.168.1.135:50010

If you are using the CDH integration platform, you can also perform data redistribution via CM :

Step 1: First select the HDFS component's page, as follows:

Step 2: Find the action selection on the right side of the page, select the data rebalance option from the drop-down box

Step 3: Determine "rebalance" to start installing the default settings rule redistribution DataNode block data, you can use the CM log to view the specific execution process.

Block data balancer re-distribution in HDFs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.