"Hadoop" streaming file distribution and packaging

Source: Internet
Author: User
Tags zip file permissions hadoop fs

If the executable file, script, or configuration file required for the program to run does not exist on the compute nodes of the Hadoop cluster, you first need to distribute the files to the cluster for a successful calculation. Hadoop provides a mechanism for automatically distributing files and compressing packages by simply configuring the appropriate parameters when you start the streaming job. The following is an introduction and comparison:

More:http://hadoop.apache.org/mapreduce/docs/current/streaming.html
Distributing files with-file

We generally use it to distribute mapper reducer programs as well as configuration files, even dictionary files. Distribute the file using the -file/path/to/filename option, upload the local file/path/to/filename to the cluster, the compute node is distributed to the file, and it is under the default temporary calculation path. For locally executable files, in addition to the specified mapper or reducer programs, you may not have executable permissions after distribution, so you need to run chmod +x in wrappers such as mapper.sh./filename Set executable permissions, then set-mapper " Mapper.sh ". Distributing files with-cachefile

If files (such as dictionary files, such as large files, they are generally rarely updated, if you use-file, each boot needs to be uploaded from the local, will reduce the startup efficiency) stored in HDFs, you want to calculate on each compute node as a local file processing, you can use -cachefile Hdfs://host:port/path/to/file#linkname option in the compute node cache file, the streaming program accesses the file through the./linkname.

Distribution dictionary files are distributed from HDFS to compute nodes and linked to the current working directory (that is, multiple compute nodes on the same machine are shared):

Note : You cannot write anything to a file distributed by-cachefile (avoid write conflicts, etc.).

distributing a compressed package with-cachearchive

First, all files and directories in the local app directory are packaged and compressed, and then uploaded to HDFs's/user/test/ app.tar.gz, launch the streaming task using the-cachearchive option to distribute the app.tar.gz to the compute nodes and extract them to the app directory, and then create a link to the app directory in the current working directory,-mapper options specify app/ mapper.pl for mapper programs, the-reducer option specifies app/reducer.pl as reducer programs, which can read./dict/dict.txt files. When you pack locally, you go to the directory app instead of packaging it in the app's upper-level directory, or you can access the mapper.pl file via app/app/mapper.pl.

Hadoop supports ZIP, jar, tar.gz format of the compressed package, due to the Java Decompression Zip package will lose the file permissions information and encountered Chinese file name error, see the recommended use tar.gz compression package.

$ tar zcf app.tar.gz–c app. #本地打包

$ $HADOOP _home/bin/hadoop fs–put app.tar.gz/user/test/app.tar.gz #包上传到HDFS

$ $HADOOP _home/bin/hadoop streaming-input/user/test/input-output/user/test/output-mapper "Perl app/mapper.pl"- Reducer "Perl app/reducer.pl" -cachearchive hdfs://namenode:port/user/test/app.tar.gz #app \

-jobconf mapred.job.name= "Cache-archive-demo"

Note: As with Cachefile, unzip the compressed package, do not CD into the inside of the directory to create or modify files, multiple tasks simultaneously modify a file may cause mutual influence

three ways to distribute the file:-file the client local file into a jar package to HDFs and then distributes it to the compute node,-cachefile distributes the HDFs file to the compute node,-cachearchive distributes the HDFs compressed file to the compute node and extracts it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.