Many interfaces, such as the Java API, are focused on the HDFS access model, and if you want to manipulate a set of files, you need to write a program to perform parallel operations. HDFs provides a very useful program--distcp to replicate large data volumes in parallel in the Hadoop file system. Distcp generally applies to data transfer between two HDFs clusters . If two clusters are running on the same Hadoop version , you can use HDFs mode:
Hadoop distcp Hdfs://namenode1/foo Hdfs://namenode2/bar
This command copies the/foo folder in the first cluster and the files at the bottom of the file to the/bar directory in the second cluster, which appears in the second cluster in the directory structure of/bar/foo. If the/bar directory does not exist, the system creates a new one. You can also specify multiple data sources, and all of the content will be copied to the destination path. It is important to note that the source path must be an absolute path. namely Hdfs://namenode1/foo
By default, although distcp skips files that already exist on the target path, you can choose to overwrite them with the-overwirte option , or you can use the-update option to override only the files that have been updated.
The distcp operation has a number of options to set, such as ignoring failures, restricting files, or copying the amount of data. You can view the instructions for using this action by entering the directive directly or by not attaching an option. That is Distcp. When implemented, the DISTCP operation is parsed into a mapreduce operation to execute, and when there is no reducer operation, the copy operation is run in parallel in the cluster node as a map operation. Therefore, each file can be used as a map operation to perform the copy operation. Distcp, by performing multiple file aggregation bundling operations, ensures that each map operation performs the same amount of data as much as possible. So, how is the map operation determined when executing DISTCP? Because the system needs to ensure that the amount of data executed by each map operation is reasonable to minimize the cost of map execution, as a rule, each map performs at least 256MB of data ( Unless the total amount of data copied is less than 256MB. For example, to replicate 1GB of data, the system will allocate 4 map tasks, when the amount of data is very large, you need to limit the number of map tasks executed to limit network bandwidth and cluster usage. By default, a node of each cluster performs up to 20 map tasks. For example, to replicate 1000GB data into a 100-node cluster, the system allocates 2000 map tasks (20 per node), meaning that each node replicates 512MB on average. You can also reduce the map task volume by adjusting the-m parameter of the Distcp , which means allocating 1000 maps per node, compared to-M 1000, which allocates 1GB of data .
If you try to replicate between HDFs clusters using DISTCP, HDFs runs on top of different versions of Hadoop, and replication will fail because of a mismatch in the RPC system. To correct this error, you can access it using the hftp of the skeleton http. Because the task is to be executed in the target cluster, the RPC version of HDFs needs to match, and the code that runs in HFTF mode is as follows:
Hadoop distcp hftp://namenode1:50070/foo Hdfs://namenode2/bar
Note exhausted, to define the Namenode network interface in the URI of the access source, this interface is set by the Dfs.http.address property value, and the default value is 50070.
DISTCP Command for HDFs