Hadoop streaming-archives decompression jar, zip, tar.gz validation

Source: Internet
Author: User

1, Archives function Description:

One of the Distributedcache methods in Hadoop (a reference article later in the other reference article) is to distribute the specified files to the working directory of each task, with the name suffix ". Jar", ". zip", ". tar.gz", ". tgz" Files automatically decompression, by default, the extracted content into the working directory under the name of the pre-decompression directory, such as the compressed package is Dict.zip, then the contents of the extract into the directory dict.zip. To do this, you can give the file an individual name/soft link, such as dict.zip#dict, so that the compressed package will be extracted to the directory Dict.

2. Test jar file (Basic direct excerpt of reference document)

$ ls test_jar/file  file1    file2 file = this is  file1 (when the experiment was wrong here, it should be used file1, the results have no effect, do not modify) FILE2 = THIS IS FILE2$ JAR CVF  cache.jar -c test_jar/ .$ hdfs dfs -put cache.jar /user/work/ cachefile#touch  a input.txt file, then put it to  /user/work/cachefile$ hdfs dfs -cat /user/ root/samples/cachefile/input.txtcache/file    (cache is the extracted directory name, with the # redefined alias, participate in the following) Cache/file2hadoop_ home=/home/hadoop/hadoop-2.3.0-cdh5.1.3$hadoop_home/bin/hadoop fs -rmr /cacheout/$HADOOP _home/ bin/hadoop  jar  $HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-2.3.0-cdh5.1.3.jar   -archives  /user/work/cachefile/cache.jar#cache  -dmapred.map.tasks=1  - Dmapred.reduce.tasks=1  -dmapred.job.name= "Experiment"   -input  "Cachefile/input2.txt"    -output  " /cacheout/"  -mapper " Xargs cat "  -reducer " Cat " hadoop fs - Cat /cacheout/*this is file 2this is file1

3. Test Zip & tar.gz

Package zip, tar.gz package separately, put to HDFs to continue testing.

-archives/user/work/cachefile/cache.tar.gz#cache \ Only modify the suffix name, will report the file cannot find the error

Troubleshooting: Confirm whether can decompress, change mapper to:

-mapper "ls cache" \

Discovery: Jar File: Results have 4 files, respectively Meta-inf, file, File1, file2

Zip & tar.gz: Only one, is the directory name of Test_jar

Then look at the compressed files of 3 kinds of packages, obviously the decompression succeeded, the reason why the file is not found is the directory problem, this will be studied in detail under the 3 packaging method, no longer repeat. :


Summary:-archives is a very useful parameter, but in particular, you should pay attention to the directory problem in use.



Reference:

Http://blog.javachen.com/2015/02/12/hadoop-streaming.html

Http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html #Working_with_Large_Files_and_Archives

http://dongxicheng.org/mapreduce-nextgen/hadoop-distributedcache-details/



Hadoop streaming-archives decompression jar, zip, tar.gz validation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.