1) Distcp (distributed copy) is a tool for copying between large-scale clusters within and between clusters.
2) The DISTCP command is implemented in the form of an Mr Job (no r Task), with the list of files and directories as input to the M task. Each file is copied by a M task, distcp try to import the same size and the same files into the same m task. This allows you to copy roughly the same amount of data per m task.
3) copy between clusters (HDFs version is the same):
bash$ Hadoop distcp HDFs://nn1:8020/foo/bar HDFs://Nn2:8020/bar/foo
This command expands and stores all files or directory names in the/foo/bar directory of the NN1 cluster into a temporary file, where copies of the contents of the files are assigned to multiple m tasks, and each TT performs a copy operation from NN1 to NN2, respectively. "distcp operation with absolute path "
4) Multiple Source directory copies
bash$ Hadoop distcp HDFs://nn1:8020/foo/a hdfs://nn1:8020/foo/b HDFs:// Nn2:8020/bar/foo
5) Updates - Update and overwrite -overwrite
By default, if a file already exists with the same name as the destination of the copy, the files are skipped by default. You can specify to overwrite the file with the same name through the-overwrite option, or update the file with the same name with the-update option. For more usage of DISTCP, you can run the Hadoop distcp command without parameters to see its usage.
6) Copy between different versions of HDFs
If the two cluster Hadoop version is inconsistent , you cannot use the HDFs identifier to copy files because the RPC system is incompatible.
- One approach is to use the hftp file system based on read-only HTTP to read the source data (this command executes on the target cluster to ensure that the RPC version is compatible), and in the command you need to specify the NN1 network port. It is specified by dfs.http.address and defaults to 50070.
% Hadoop distcp hftp://namenode1:50070/foo HDFs://Namenode2/bar
- Another approach is to use the Webhdfs protocol (replacing the HFTP Protocol) so that HTTP can be used at both the source and destination of the copy without worrying about incompatible versions:
% Hadoop distcp Webhdfs://namenode1:50070/foo Webhdfs://Namenode2:50070/bar
M The number of tasks is determined as follows:
1) Given the overhead of creating m tasks, each m task processes at least 256MB of data (if the total input file is less than 256MB, all of the input data is given to an M task to execute). For example , a 1GB size input data will be assigned four m tasks to copy.
2) If the data to be copied is really large, then it cannot be divided by the criteria for each m task 256MB input data, as this may require a lot of m tasks to be created. This can be divided by 20 m tasks per DN, For example , if there is 1000GB of input data and 100 nodes, this will start 100*20=2000 m task to copy data, each m task copy 512MB data. We can also specify the number of M to use with the-M option, for example, the-M 1000 initiates 1000 m tasks, each of which copies 1GB data.
HDFS Parallel replication