HDFS Parallel replication

Source: Internet
Author: User

1) Distcp (distributed copy) is a tool for copying between large-scale clusters within and between clusters.

2) The DISTCP command is implemented in the form of an Mr Job (no r Task), with the list of files and directories as input to the M task. Each file is copied by a M task, distcp try to import the same size and the same files into the same m task. This allows you to copy roughly the same amount of data per m task.

3) copy between clusters (HDFs version is the same):

bash$ Hadoop distcp HDFs://nn1:8020/foo/bar HDFs://Nn2:8020/bar/foo

This command expands and stores all files or directory names in the/foo/bar directory of the NN1 cluster into a temporary file, where copies of the contents of the files are assigned to multiple m tasks, and each TT performs a copy operation from NN1 to NN2, respectively. "distcp operation with absolute path "

4) Multiple Source directory copies

bash$ Hadoop distcp HDFs://nn1:8020/foo/a hdfs://nn1:8020/foo/b HDFs://  Nn2:8020/bar/foo

5) Updates - Update and overwrite -overwrite

By default, if a file already exists with the same name as the destination of the copy, the files are skipped by default. You can specify to overwrite the file with the same name through the-overwrite option, or update the file with the same name with the-update option. For more usage of DISTCP, you can run the Hadoop distcp command without parameters to see its usage.

6) Copy between different versions of HDFs

If the two cluster Hadoop version is inconsistent , you cannot use the HDFs identifier to copy files because the RPC system is incompatible.

    • One approach is to use the hftp file system based on read-only HTTP to read the source data (this command executes on the target cluster to ensure that the RPC version is compatible), and in the command you need to specify the NN1 network port. It is specified by dfs.http.address and defaults to 50070.
% Hadoop distcp hftp://namenode1:50070/foo HDFs://Namenode2/bar
    • Another approach is to use the Webhdfs protocol (replacing the HFTP Protocol) so that HTTP can be used at both the source and destination of the copy without worrying about incompatible versions:
% Hadoop distcp Webhdfs://namenode1:50070/foo Webhdfs://Namenode2:50070/bar 

M The number of tasks is determined as follows:

1) Given the overhead of creating m tasks, each m task processes at least 256MB of data (if the total input file is less than 256MB, all of the input data is given to an M task to execute). For example , a 1GB size input data will be assigned four m tasks to copy.

2) If the data to be copied is really large, then it cannot be divided by the criteria for each m task 256MB input data, as this may require a lot of m tasks to be created. This can be divided by 20 m tasks per DN, For example , if there is 1000GB of input data and 100 nodes, this will start 100*20=2000 m task to copy data, each m task copy 512MB data. We can also specify the number of M to use with the-M option, for example, the-M 1000 initiates 1000 m tasks, each of which copies 1GB data.

HDFS Parallel replication

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.