DISTCP Command for HDFs

Last Update:2016-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many interfaces, such as the Java API, are focused on the HDFS access model, and if you want to manipulate a set of files, you need to write a program to perform parallel operations. HDFs provides a very useful program--distcp to replicate large data volumes in parallel in the Hadoop file system. Distcp generally applies to data transfer between two HDFs clusters . If two clusters are running on the same Hadoop version , you can use HDFs mode:

Hadoop distcp Hdfs://namenode1/foo Hdfs://namenode2/bar

This command copies the/foo folder in the first cluster and the files at the bottom of the file to the/bar directory in the second cluster, which appears in the second cluster in the directory structure of/bar/foo. If the/bar directory does not exist, the system creates a new one. You can also specify multiple data sources, and all of the content will be copied to the destination path. It is important to note that the source path must be an absolute path. namely Hdfs://namenode1/foo

By default, although distcp skips files that already exist on the target path, you can choose to overwrite them with the-overwirte option , or you can use the-update option to override only the files that have been updated.

The distcp operation has a number of options to set, such as ignoring failures, restricting files, or copying the amount of data. You can view the instructions for using this action by entering the directive directly or by not attaching an option. That is Distcp. When implemented, the DISTCP operation is parsed into a mapreduce operation to execute, and when there is no reducer operation, the copy operation is run in parallel in the cluster node as a map operation. Therefore, each file can be used as a map operation to perform the copy operation. Distcp, by performing multiple file aggregation bundling operations, ensures that each map operation performs the same amount of data as much as possible. So, how is the map operation determined when executing DISTCP? Because the system needs to ensure that the amount of data executed by each map operation is reasonable to minimize the cost of map execution, as a rule, each map performs at least 256MB of data ( Unless the total amount of data copied is less than 256MB. For example, to replicate 1GB of data, the system will allocate 4 map tasks, when the amount of data is very large, you need to limit the number of map tasks executed to limit network bandwidth and cluster usage. By default, a node of each cluster performs up to 20 map tasks. For example, to replicate 1000GB data into a 100-node cluster, the system allocates 2000 map tasks (20 per node), meaning that each node replicates 512MB on average. You can also reduce the map task volume by adjusting the-m parameter of the Distcp , which means allocating 1000 maps per node, compared to-M 1000, which allocates 1GB of data .

If you try to replicate between HDFs clusters using DISTCP, HDFs runs on top of different versions of Hadoop, and replication will fail because of a mismatch in the RPC system. To correct this error, you can access it using the hftp of the skeleton http. Because the task is to be executed in the target cluster, the RPC version of HDFs needs to match, and the code that runs in HFTF mode is as follows:

Hadoop distcp hftp://namenode1:50070/foo Hdfs://namenode2/bar
Note exhausted, to define the Namenode network interface in the URI of the access source, this interface is set by the Dfs.http.address property value, and the default value is 50070.

DISTCP Command for HDFs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

DISTCP Command for HDFs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

DISTCP Command for HDFs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support