Hadoop Streaming programming examples

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Yes value 2345 passed

Tags content counter cpp customize echo error examples file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop Streaming is a multi-language programming tool provided by Hadoop. Users can write MapReduce programs in any language. This article introduces several Hadoop Streaming programming examples, and you can focus on the following aspects:

(1) What programming specifications should be followed for how to write Mapper and Reduce for a language

(2) How to customize Hadoop Counter in Hadoop Streaming

(3) How to customize the state information in Hadoop Streaming, and then give the user feedback on the current job execution progress

(4) How to print debug logs in Hadoop Streaming, where you can see these logs

(5) How to use Hadoop Streaming to process binary files, not just text files

This article focuses on the first four issues, given the C + + and Shell prepared Wordcount examples for your reference.

1. C + + version of WordCount

(1) Mapper implementation (mapper.cpp)

12345678910111213141516 #include <iostream> #include <string> using namespace std; int main () {string key; while (cin >> key) {cout << key << "\ t" << "1" // Define counter named counter_no in group counter_group cerr << "reporter: counter: counter_group, counter_no, 1 \ n"; // dispaly status cerr << "reporter: status: processing ...... \ n"; / / Print logs for testing cerr << "This is log, will be printed in stdout file \ n";} return 0;}

(2) Reducer implementation (reducer.cpp)

12345678910111213141516171819202122 #include <iostream> #include <string> using namespace std; int main () {// The reducer will be encapsulated as a separate process, thus requiring the main function string cur_key, last_key, value; cin >> cur_key >> while (cin >> cur_key) {// read the map task output cin >> value; if (last_key! = cur_key) {// identify the next key cout << last_key "else" // Get all the same value of key n ++; // same value of key, cumulative value}} cout << last_key << "\ t" << n << endl; return 0;}

(3) compile and run

Compile the above two programs:

g ++ -o mapper mapper.cpp

g ++ -o reducer reducer.cpp

have a test:

echo "dong xicheng is here now, talk to dong xicheng now" | ./mapper | sort | ./reducer

Note: The above test method will frequently print the following string, you can comment out, these strings hadoop able to identify

reporter: counter: counter_group, counter_no, 1

reporter: status: processing ......

This is log, will be printed in stdout file

After the test is passed, the job can be submitted to the cluster (run_cpp_mr.sh) with the following script:

1234567891011121314 #! / Bin / bashHADOOP_HOME = / opt / yarn-clientINPUT_PATH = / test / inputOUTPUT_PATH = / test / outputecho "Clearing output path: $ OUTPUT_PATH" $ HADOOP_HOME / bin / hadoop fs -rmr $ OUTPUT_PATH $ {HADOOP_HOME} / bin / hadoop jar \ $ {HADOOP_HOME} /share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -files mapper, reducer \ -input $ INPUT_PATH \ -output $ OUTPUT_PATH \ -mapper mapper \ -reducer reducer

2. Shell version WordCount

(1) Mapper implementation (mapper.sh)

123456789101112131415 #! / Bin / bashwhile read LINE; do for word in $ LINE do echo "$ word 1" # in streaming, we define counter by # [reporter: counter: <group>, <counter>, <amount>] # define a counter named counter_no, in group counter_group # increase this counter by 1 # counter shoule be output output stderr echo "reporter: counter: counter_group, counter_no, 1"> & 2 echo "reporter: counter: status, processing ..... "> & 2 echo" This is log for testing, will be printed in stdout file "> & 2 donedone

(2) Reducer implementation (mapper.sh)

12345678910111213141516 #! / Bin / bashcount = 0started = 0word = "" while read LINE; do newword = `echo $ LINE | cut -d '' -f 1` if [" $ word "! =" $ Newword "]; then [$ started -ne 0] && echo "$ word \ t $ count" word = $ newword count = 1 started = 1 else count = $ (($ count + 1)) fidoneecho "$ word \ t $ count"

(3) test run

Test the above two programs:

echo "dong xicheng is here now, talk to dong xicheng now" | sh mapper.sh | sort | sh reducer.sh

Note: The above test method will frequently print the following string, you can comment out, these strings hadoop able to identify

reporter: counter: counter_group, counter_no, 1

reporter: status: processing ......

This is log, will be printed in stdout file

After the test is passed, the job can be submitted to the cluster (run_shell_mr.sh) with the following script:

1234567891011121314 #! / Bin / bashHADOOP_HOME = / opt / yarn-clientINPUT_PATH = / test / inputOUTPUT_PATH = / test / outputecho "Clearing output path: $ OUTPUT_PATH" $ HADOOP_HOME / bin / hadoop fs -rmr $ OUTPUT_PATH $ {HADOOP_HOME} / bin / hadoop jar \ $ {HADOOP_HOME} /share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -files mapper.sh, reducer.sh \ -input $ INPUT_PATH \ -output $ OUTPUT_PATH \ -mapper "sh mapper.sh "\ -reducer" sh reducer.sh "

3. Program Description

In Hadoop Streaming, standard input, standard output and error output each have magical effects, of which the standard input and output are used to accept input data and output processing results, and the meaning of error output depends on the content:

(1) If the content of the standard error is: reporter: counter: group, counter, amount, which means that the name is counter, and the value of the hadoop counter in the group is increased by the amount that hadoop will read when this counter is read for the first time It, then find the counter table, increase the counter value

(2) If the content of the standard error output is: reporter: status: message, it means that the message message is printed on the interface or the terminal, and may be some status prompt message

(3) If the contents of the error output is not the above two cases, said the debug log, Hadoop will be redirected to the stderr file. Note: Each Task corresponds to three log files, namely stdout, stderr and syslog, which are text files. You can view the contents of these three log files on the web interface. You can also log in to the node where the task is located and to the corresponding directory View.

In addition, it is important to note that the default key and value separator for the Map Task output is \ t. Hadoop splits the key and value according to \ t in the Map and Reduce phases and sorts the key. It is important to note that you can, of course, Use stream.map.output.field.separator to specify a new delimiter.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More