HDFS Parallel replication

Last Update:2015-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1) Distcp (distributed copy) is a tool for copying between large-scale clusters within and between clusters.

2) The DISTCP command is implemented in the form of an Mr Job (no r Task), with the list of files and directories as input to the M task. Each file is copied by a M task, distcp try to import the same size and the same files into the same m task. This allows you to copy roughly the same amount of data per m task.

3) copy between clusters (HDFs version is the same):

bash$ Hadoop distcp HDFs://nn1:8020/foo/bar HDFs://Nn2:8020/bar/foo

This command expands and stores all files or directory names in the/foo/bar directory of the NN1 cluster into a temporary file, where copies of the contents of the files are assigned to multiple m tasks, and each TT performs a copy operation from NN1 to NN2, respectively. "distcp operation with absolute path "

4) Multiple Source directory copies

bash$ Hadoop distcp HDFs://nn1:8020/foo/a hdfs://nn1:8020/foo/b HDFs://  Nn2:8020/bar/foo

5) Updates - Update and overwrite -overwrite

By default, if a file already exists with the same name as the destination of the copy, the files are skipped by default. You can specify to overwrite the file with the same name through the-overwrite option, or update the file with the same name with the-update option. For more usage of DISTCP, you can run the Hadoop distcp command without parameters to see its usage.

6) Copy between different versions of HDFs

If the two cluster Hadoop version is inconsistent , you cannot use the HDFs identifier to copy files because the RPC system is incompatible.

One approach is to use the hftp file system based on read-only HTTP to read the source data (this command executes on the target cluster to ensure that the RPC version is compatible), and in the command you need to specify the NN1 network port. It is specified by dfs.http.address and defaults to 50070.

% Hadoop distcp hftp://namenode1:50070/foo HDFs://Namenode2/bar

Another approach is to use the Webhdfs protocol (replacing the HFTP Protocol) so that HTTP can be used at both the source and destination of the copy without worrying about incompatible versions:

% Hadoop distcp Webhdfs://namenode1:50070/foo Webhdfs://Namenode2:50070/bar

M The number of tasks is determined as follows:

1) Given the overhead of creating m tasks, each m task processes at least 256MB of data (if the total input file is less than 256MB, all of the input data is given to an M task to execute). For example , a 1GB size input data will be assigned four m tasks to copy.

2) If the data to be copied is really large, then it cannot be divided by the criteria for each m task 256MB input data, as this may require a lot of m tasks to be created. This can be divided by 20 m tasks per DN, For example , if there is 1000GB of input data and 100 nodes, this will start 100*20=2000 m task to copy data, each m task copy 512MB data. We can also specify the number of M to use with the-M option, for example, the-M 1000 initiates 1000 m tasks, each of which copies 1GB data.

HDFS Parallel replication

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HDFS Parallel replication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HDFS Parallel replication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support