Hadoop streaming cachefile and cachearchive options

Source: Internet
Author: User
Tags testlink
large files and archives in Hadoop streaming

Tasks use the-cachefile and-cachearchive options to distribute files and archives in the cluster, and the parameters of the options are the URIs of the files or archives that the user has uploaded to HDFs. These files and archives are cached between different jobs. The user can configure the parameters by Fs.default.name.config the host and fs_port that the file resides in.

This is an example of using the-cachefile option:

-cachefile Hdfs://host:fs_port/user/testfile.txt#testlink

In the above example, the following part of the URL is the name of the symbolic link that is created in the current working directory of the task. The current working directory for the task here has a "testlink" symbolic link, which points to the local copy of the Testfile.txt file. If there are multiple files, the options can be written as:

-cachefile Hdfs://host:fs_port/user/testfile1.txt#testlink1-cachefile hdfs://host:fs_port/user/testfile2.txt# Testlink2

The-cachearchive option is used to copy the jar file to the current working directory of the task and automatically unzip the jar file. For example:

-cachearchive Hdfs://host:fs_port/user/testfile.jar#testlink3

In the example above, Testlink3 is the symbolic link under the current working directory, which points to the directory after Testfile.jar decompression.

The following is another example of using the-cachearchive option. Among them, the Input.txt file has two lines of content, namely the names of two files: Testlink/cache.txt and Testlink/cache2.txt. "Testlink" is a symbolic link to the archive directory (the directory where the jar file was extracted), and there are "cache.txt" and "Cache2.txt" two files in this directory.

$HADOOP _home/bin/hadoop  jar $HADOOP _home/hadoop-streaming.jar/
                  -input "/user/me/samples/cachefile/input.txt"  /
                  -mapper "Xargs cat"  /
                  -reducer "Cat"  /
                  -output "/user/me/samples/cachefile/out"/  
                  -cachearchive ' Hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink '/  
                  -jobconf Mapred.map.tasks=1/
                  -jobconf Mapred.reduce.tasks=1/ 
                  -jobconf mapred.job.name= "Experiment"
$ ls test_jar/
Cache.txt  Cache2.txt
$ jar CVF cachedir.jar-c test_jar/.
Added manifest
Adding:cache.txt (in = +) (out=) (deflated 3%)
Adding:cache2.txt (in = Notoginseng) (out=) (deflated 5%)
$ Hadoop dfs-put Cachedir.jar samples/cachefile
$ Hadoop Dfs-cat/user/me/samples/cachefile/input.txt
Testlink/cache.txt
Testlink/cache2.txt
$ cat Test_jar/cache.txt 
This is just the cache string
$ cat Test_jar/cache2.txt 
This is just the second cache string
$ Hadoop dfs-ls/user/me/samples/cachefile/out      
Found 1 Items
/user/me/samples/cachefile/out/part-00000     69
$ Hadoop dfs-cat/user/me/samples/cachefile/out/part-00000
This is just the cache string   
This is just the second cache string

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.