1, Archives function Description:
One of the Distributedcache methods in Hadoop (a reference article later in the other reference article) is to distribute the specified files to the working directory of each task, with the name suffix ". Jar", ". zip", ". tar.gz", ". tgz" Files automatically decompression, by default, the extracted content into the working directory under the name of the pre-decompression directory, such as the compressed package is Dict.zip, then the contents of the extract into the directory dict.zip. To do this, you can give the file an individual name/soft link, such as dict.zip#dict, so that the compressed package will be extracted to the directory Dict.
2. Test jar file (Basic direct excerpt of reference document)
$ ls test_jar/file file1 file2 file = this is file1 (when the experiment was wrong here, it should be used file1, the results have no effect, do not modify) FILE2 = THIS IS FILE2$ JAR CVF cache.jar -c test_jar/ .$ hdfs dfs -put cache.jar /user/work/ cachefile#touch a input.txt file, then put it to /user/work/cachefile$ hdfs dfs -cat /user/ root/samples/cachefile/input.txtcache/file (cache is the extracted directory name, with the # redefined alias, participate in the following) Cache/file2hadoop_ home=/home/hadoop/hadoop-2.3.0-cdh5.1.3$hadoop_home/bin/hadoop fs -rmr /cacheout/$HADOOP _home/ bin/hadoop jar $HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-2.3.0-cdh5.1.3.jar -archives /user/work/cachefile/cache.jar#cache -dmapred.map.tasks=1 - Dmapred.reduce.tasks=1 -dmapred.job.name= "Experiment" -input "Cachefile/input2.txt" -output " /cacheout/" -mapper " Xargs cat " -reducer " Cat " hadoop fs - Cat /cacheout/*this is file 2this is file1
3. Test Zip & tar.gz
Package zip, tar.gz package separately, put to HDFs to continue testing.
-archives/user/work/cachefile/cache.tar.gz#cache \ Only modify the suffix name, will report the file cannot find the error
Troubleshooting: Confirm whether can decompress, change mapper to:
-mapper "ls cache" \
Discovery: Jar File: Results have 4 files, respectively Meta-inf, file, File1, file2
Zip & tar.gz: Only one, is the directory name of Test_jar
Then look at the compressed files of 3 kinds of packages, obviously the decompression succeeded, the reason why the file is not found is the directory problem, this will be studied in detail under the 3 packaging method, no longer repeat. :
Summary:-archives is a very useful parameter, but in particular, you should pay attention to the directory problem in use.
Reference:
Http://blog.javachen.com/2015/02/12/hadoop-streaming.html
Http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html #Working_with_Large_Files_and_Archives
http://dongxicheng.org/mapreduce-nextgen/hadoop-distributedcache-details/
Hadoop streaming-archives decompression jar, zip, tar.gz validation