"Hadoop" streaming file distribution and packaging

Last Update:2018-07-20 Source: Internet

Author: User

Tags zip file permissions hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If the executable file, script, or configuration file required for the program to run does not exist on the compute nodes of the Hadoop cluster, you first need to distribute the files to the cluster for a successful calculation. Hadoop provides a mechanism for automatically distributing files and compressing packages by simply configuring the appropriate parameters when you start the streaming job. The following is an introduction and comparison:

More:http://hadoop.apache.org/mapreduce/docs/current/streaming.html
Distributing files with-file

We generally use it to distribute mapper reducer programs as well as configuration files, even dictionary files. Distribute the file using the -file/path/to/filename option, upload the local file/path/to/filename to the cluster, the compute node is distributed to the file, and it is under the default temporary calculation path. For locally executable files, in addition to the specified mapper or reducer programs, you may not have executable permissions after distribution, so you need to run chmod +x in wrappers such as mapper.sh./filename Set executable permissions, then set-mapper " Mapper.sh ". Distributing files with-cachefile

If files (such as dictionary files, such as large files, they are generally rarely updated, if you use-file, each boot needs to be uploaded from the local, will reduce the startup efficiency) stored in HDFs, you want to calculate on each compute node as a local file processing, you can use -cachefile Hdfs://host:port/path/to/file#linkname option in the compute node cache file, the streaming program accesses the file through the./linkname.

Distribution dictionary files are distributed from HDFS to compute nodes and linked to the current working directory (that is, multiple compute nodes on the same machine are shared):

Note : You cannot write anything to a file distributed by-cachefile (avoid write conflicts, etc.).

distributing a compressed package with-cachearchive

First, all files and directories in the local app directory are packaged and compressed, and then uploaded to HDFs's/user/test/ app.tar.gz, launch the streaming task using the-cachearchive option to distribute the app.tar.gz to the compute nodes and extract them to the app directory, and then create a link to the app directory in the current working directory,-mapper options specify app/ mapper.pl for mapper programs, the-reducer option specifies app/reducer.pl as reducer programs, which can read./dict/dict.txt files. When you pack locally, you go to the directory app instead of packaging it in the app's upper-level directory, or you can access the mapper.pl file via app/app/mapper.pl.

Hadoop supports ZIP, jar, tar.gz format of the compressed package, due to the Java Decompression Zip package will lose the file permissions information and encountered Chinese file name error, see the recommended use tar.gz compression package.

$ tar zcf app.tar.gz–c app. #本地打包

$ $HADOOP _home/bin/hadoop fs–put app.tar.gz/user/test/app.tar.gz #包上传到HDFS

$ $HADOOP _home/bin/hadoop streaming-input/user/test/input-output/user/test/output-mapper "Perl app/mapper.pl"- Reducer "Perl app/reducer.pl" -cachearchive hdfs://namenode:port/user/test/app.tar.gz #app \

-jobconf mapred.job.name= "Cache-archive-demo"

Note: As with Cachefile, unzip the compressed package, do not CD into the inside of the directory to create or modify files, multiple tasks simultaneously modify a file may cause mutual influence

three ways to distribute the file:-file the client local file into a jar package to HDFs and then distributes it to the compute node,-cachefile distributes the HDFs file to the compute node,-cachearchive distributes the HDFs compressed file to the compute node and extracts it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More