large files and archives in Hadoop streaming
Tasks use the-cachefile and-cachearchive options to distribute files and archives in the cluster, and the parameters of the options are the URIs of the files or archives that the user has uploaded to HDFs. These files and archives are cached between different jobs. The user can configure the parameters by Fs.default.name.config the host and fs_port that the file resides in.
This is an example of using the-cachefile option:
-cachefile Hdfs://host:fs_port/user/testfile.txt#testlink
In the above example, the following part of the URL is the name of the symbolic link that is created in the current working directory of the task. The current working directory for the task here has a "testlink" symbolic link, which points to the local copy of the Testfile.txt file. If there are multiple files, the options can be written as:
-cachefile Hdfs://host:fs_port/user/testfile1.txt#testlink1-cachefile hdfs://host:fs_port/user/testfile2.txt# Testlink2
The-cachearchive option is used to copy the jar file to the current working directory of the task and automatically unzip the jar file. For example:
-cachearchive Hdfs://host:fs_port/user/testfile.jar#testlink3
In the example above, Testlink3 is the symbolic link under the current working directory, which points to the directory after Testfile.jar decompression.
The following is another example of using the-cachearchive option. Among them, the Input.txt file has two lines of content, namely the names of two files: Testlink/cache.txt and Testlink/cache2.txt. "Testlink" is a symbolic link to the archive directory (the directory where the jar file was extracted), and there are "cache.txt" and "Cache2.txt" two files in this directory.
$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar/
-input "/user/me/samples/cachefile/input.txt" /
-mapper "Xargs cat" /
-reducer "Cat" /
-output "/user/me/samples/cachefile/out"/
-cachearchive ' Hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink '/
-jobconf Mapred.map.tasks=1/
-jobconf Mapred.reduce.tasks=1/
-jobconf mapred.job.name= "Experiment"
$ ls test_jar/
Cache.txt Cache2.txt
$ jar CVF cachedir.jar-c test_jar/.
Added manifest
Adding:cache.txt (in = +) (out=) (deflated 3%)
Adding:cache2.txt (in = Notoginseng) (out=) (deflated 5%)
$ Hadoop dfs-put Cachedir.jar samples/cachefile
$ Hadoop Dfs-cat/user/me/samples/cachefile/input.txt
Testlink/cache.txt
Testlink/cache2.txt
$ cat Test_jar/cache.txt
This is just the cache string
$ cat Test_jar/cache2.txt
This is just the second cache string
$ Hadoop dfs-ls/user/me/samples/cachefile/out
Found 1 Items
/user/me/samples/cachefile/out/part-00000 69
$ Hadoop dfs-cat/user/me/samples/cachefile/out/part-00000
This is just the cache string
This is just the second cache string