Hadoop distcp

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Nbsp; dfs copy

Tags basic command line copy default different directory distributed distribution

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

DISTCP (distributed copy) is a tool for copying within and between clusters of large clusters. It uses Map/reduce to implement file distribution, error handling and recovery, and report generation. It takes the list of files and directories as input to the map task, and each task completes a copy of some of the files in the source list. Because of the use of the Map/reduce method, the tool has a special place in semantics and execution. This document provides guidance for common DISTCP operations and describes its working model.

Basic use method of

Distcp the most commonly used copies between clusters:

bash$ Hadoop distcp hdfs://nn1:8020/foo/bar \
Hdfs://nn2:8020/bar/foo

This command expands and stores all the file or directory names in the/foo/bar directory of the NN1 cluster into a temporary file where copies of the contents of the files are assigned to multiple map tasks, and each tasktracker performs a copy operation from NN1 to Nn2 respectively. Note Distcp uses an absolute path to operate.

Multiple source directories can be specified on the command line:

bash$ Hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
Hdfs://nn2:8020/bar/foo

or use the-f option to get multiple sources from a file:
bash$ Hadoop distcp-f hdfs://nn1:8020/srclist \
Hdfs://nn2:8020/bar/foo

The contents of Srclist are
hdfs://nn1:8020/foo/a
Hdfs://nn1:8020/foo/b

When copying from multiple sources, if two sources conflict, Distcp stops copying and prompts for an error message, which is resolved according to the option settings if a conflict occurs at the destination. The default is to skip over existing target files (such as replacing operations without source files). The number of skipped files is reported at the end of each operation, but if some copy operations fail, but the subsequent attempt succeeds, the reported information may not be accurate (refer to the appendix).

Each tasktracker must be able to access and interact with the source and destination file systems. For HDFs, the source and destination will run the same version of the Protocol or use a backward-compatible protocol. (refer to the copy between the different versions).

After the copy is complete, it is recommended to generate a list of source and destination files and cross check to confirm that the copy is truly successful. Because DISTCP uses the Map/reduce and file system APIs to operate, any problems with these three or between them can affect the copy operation. The successful execution of some DISTCP commands can be accomplished by executing the command with the-update parameter again, but the user should be familiar with the syntax of the command before doing so.

It is noteworthy that when another client writes to the source file at the same time, the copy is likely to fail. Attempts to overwrite the file being written on the HDFs also fail. If a source file is moved or deleted before the copy, the copy fails to output the exception filenotfoundexception at the same time.

Option Index Identification description Note-p[rbugp]preserve

R:replication number

B:block Size

U:user

G:group

p:permission

The number of
changes will not be retained. And when-update is specified, the status of the update is not synchronized unless the file size is different (for example, the file is recreated). -I ignore failure as mentioned in the appendix, this option provides more accurate statistics on the copy than the default, and it retains the log of the failed copy operation, which can be used for debugging. Finally, if a map fails, but does not complete all the chunking attempts, this does not cause the entire job to fail. -log <logdir> log logs to <logdir>distcp each attempt to copy operations for each file and log as the output of the map. If a map fails, the log will not be retained when it is executed again. -M <num_maps> maximum number of simultaneous copies specifies the number of maps when copying data. Note that the greater the throughput is not the number of maps. -overwrite overwrite target if a map fails and the-I option is not used, not only those files that have failed to copy, all files in the block task are copied again. As mentioned below, it changes the semantics of the generated target path, so users should use this option with care. -update if the source and target size are not the same as mentioned earlier, this is not a "sync" operation. The only criterion for performing overrides is whether the source file and destination file size are the same, and if different, the source file replaces the destination file. As mentioned below, it also changes the semantics of the generated target path, and users should use caution. -F <urilist_uri> Use <urilist_uri> as the source file list this is equivalent to putting all files on the command line. The Urilist_uri list should be a fully valid URI. Update and overwrite

Here are some examples of-update and-overwrite. Consider a copy from/foo/a and/foo/b to/bar/foo, and the source path includes:

hdfs://nn1:8020/foo/a
Hdfs://nn1:8020/foo/a/aa
Hdfs://nn1:8020/foo/a/ab
Hdfs://nn1:8020/foo/b
Hdfs://nn1:8020/foo/b/ba
Hdfs://nn1:8020/foo/b/ab

If the-update or-overwrite option is not set, then two sources will be mapped to the/bar/foo/ab on the target side. If you set both options, the contents of each source directory are compared to the contents of the target directory. Distcp encounters such conflicts will terminate the operation and exit.

By default, the/BAR/FOO/A and/bar/foo/b directories are created, so there is no conflict.

Now consider a legitimate operation using-update:
Distcp-update hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
Hdfs://nn2:8020/bar

where source path/size:

hdfs://nn1:8020/foo/a
HDFS://NN1:8020/FOO/A/AA 32
Hdfs://nn1:8020/foo/a/ab 32
Hdfs://nn1:8020/foo/b
HDFS://NN1:8020/FOO/B/BA 64
Hdfs://nn1:8020/foo/b/bb 32

and destination path/size:

Hdfs://nn2:8020/bar
HDFS://NN2:8020/BAR/AA 32
Hdfs://nn2:8020/bar/ba 32
Hdfs://nn2:8020/bar/bb 64

Will produce:

Hdfs://nn2:8020/bar
HDFS://NN2:8020/BAR/AA 32
Hdfs://nn2:8020/bar/ab 32
HDFS://NN2:8020/BAR/BA 64
Hdfs://nn2:8020/bar/bb 32

Only nn2 AA files were not overwritten. If the-overwrite option is specified, all files will be overwritten.

Appendix Map Number

Distcp will try to evenly divide the content that needs to be copied so that each map copy is about the same size as the content. However, because the file is the smallest copy granularity, the number of concurrent copies of the configuration, such as map, does not necessarily increase the number of concurrent copies and the total throughput.

If the-M option is not used, DISTCP attempts to specify the number of maps as min (total_bytes/bytes.per.map, * num_task_trackers) when dispatching work, where bytes.per.map defaults to 256MB.

It is recommended that you adjust the number of maps based on the source and target cluster size, number of copies, and bandwidth for long-running or regularly run jobs.

copies between different HDFS versions

Users should use Hftpfilesystem for copies between different Hadoop versions. This is a read-only file system, so distcp must run on the target-side cluster (or, more specifically, on the tasktracker that can write to the target cluster). The format of the source is hftp://<dfs.http.address>/<path> (default dfs.http.address is <namenode>:50070).

map/reduce and Auxiliary effect

As mentioned earlier, the map copy input file fails with some minor effects.

unless you use-I, the log generated by the task is replaced by a new attempt. Unless-overwrite is used, the file is marked as "ignored" when it is successfully copied from the previous map and then executed again. If the map fails mapred.map.max.attempts times, the remaining map task is terminated (unless you use-i). If the mapred.speculative.execution is set to final and true, the result of the copy is undefined.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More