Hadoop Streaming programming examples

Source: Internet
Author: User
Keywords Yes value 2345 passed
Tags content counter cpp customize echo error examples file

Hadoop Streaming is a multi-language programming tool provided by Hadoop. Users can write MapReduce programs in any language. This article introduces several Hadoop Streaming programming examples, and you can focus on the following aspects:

(1) What programming specifications should be followed for how to write Mapper and Reduce for a language

(2) How to customize Hadoop Counter in Hadoop Streaming

(3) How to customize the state information in Hadoop Streaming, and then give the user feedback on the current job execution progress

(4) How to print debug logs in Hadoop Streaming, where you can see these logs

(5) How to use Hadoop Streaming to process binary files, not just text files

This article focuses on the first four issues, given the C + + and Shell prepared Wordcount examples for your reference.

1. C + + version of WordCount

(1) Mapper implementation (mapper.cpp)

12345678910111213141516 #include <iostream> #include <string> using namespace std; int main () {string key; while (cin >> key) {cout << key << "\ t" << "1" // Define counter named counter_no in group counter_group cerr << "reporter: counter: counter_group, counter_no, 1 \ n"; // dispaly status cerr << "reporter: status: processing ...... \ n"; / / Print logs for testing cerr << "This is log, will be printed in stdout file \ n";} return 0;}

(2) Reducer implementation (reducer.cpp)

12345678910111213141516171819202122 #include <iostream> #include <string> using namespace std; int main () {// The reducer will be encapsulated as a separate process, thus requiring the main function string cur_key, last_key, value; cin >> cur_key >> while (cin >> cur_key) {// read the map task output cin >> value; if (last_key! = cur_key) {// identify the next key cout << last_key "else" // Get all the same value of key n ++; // same value of key, cumulative value}} cout << last_key << "\ t" << n << endl; return 0;}

(3) compile and run

Compile the above two programs:

g ++ -o mapper mapper.cpp

g ++ -o reducer reducer.cpp

have a test:

echo "dong xicheng is here now, talk to dong xicheng now" | ./mapper | sort | ./reducer

Note: The above test method will frequently print the following string, you can comment out, these strings hadoop able to identify

reporter: counter: counter_group, counter_no, 1

reporter: status: processing ......

This is log, will be printed in stdout file

After the test is passed, the job can be submitted to the cluster (run_cpp_mr.sh) with the following script:

1234567891011121314 #! / Bin / bashHADOOP_HOME = / opt / yarn-clientINPUT_PATH = / test / inputOUTPUT_PATH = / test / outputecho "Clearing output path: $ OUTPUT_PATH" $ HADOOP_HOME / bin / hadoop fs -rmr $ OUTPUT_PATH $ {HADOOP_HOME} / bin / hadoop jar \ $ {HADOOP_HOME} /share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -files mapper, reducer \ -input $ INPUT_PATH \ -output $ OUTPUT_PATH \ -mapper mapper \ -reducer reducer

2. Shell version WordCount

(1) Mapper implementation (mapper.sh)

123456789101112131415 #! / Bin / bashwhile read LINE; do for word in $ LINE do echo "$ word 1" # in streaming, we define counter by # [reporter: counter: <group>, <counter>, <amount>] # define a counter named counter_no, in group counter_group # increase this counter by 1 # counter shoule be output output stderr echo "reporter: counter: counter_group, counter_no, 1"> & 2 echo "reporter: counter: status, processing ..... "> & 2 echo" This is log for testing, will be printed in stdout file "> & 2 donedone

(2) Reducer implementation (mapper.sh)

12345678910111213141516 #! / Bin / bashcount = 0started = 0word = "" while read LINE; do newword = `echo $ LINE | cut -d '' -f 1` if [" $ word "! =" $ Newword "]; then [$ started -ne 0] && echo "$ word \ t $ count" word = $ newword count = 1 started = 1 else count = $ (($ count + 1)) fidoneecho "$ word \ t $ count"

(3) test run

Test the above two programs:

echo "dong xicheng is here now, talk to dong xicheng now" | sh mapper.sh | sort | sh reducer.sh

Note: The above test method will frequently print the following string, you can comment out, these strings hadoop able to identify

reporter: counter: counter_group, counter_no, 1

reporter: status: processing ......

This is log, will be printed in stdout file

After the test is passed, the job can be submitted to the cluster (run_shell_mr.sh) with the following script:

1234567891011121314 #! / Bin / bashHADOOP_HOME = / opt / yarn-clientINPUT_PATH = / test / inputOUTPUT_PATH = / test / outputecho "Clearing output path: $ OUTPUT_PATH" $ HADOOP_HOME / bin / hadoop fs -rmr $ OUTPUT_PATH $ {HADOOP_HOME} / bin / hadoop jar \ $ {HADOOP_HOME} /share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -files mapper.sh, reducer.sh \ -input $ INPUT_PATH \ -output $ OUTPUT_PATH \ -mapper "sh mapper.sh "\ -reducer" sh reducer.sh "

3. Program Description

In Hadoop Streaming, standard input, standard output and error output each have magical effects, of which the standard input and output are used to accept input data and output processing results, and the meaning of error output depends on the content:

(1) If the content of the standard error is: reporter: counter: group, counter, amount, which means that the name is counter, and the value of the hadoop counter in the group is increased by the amount that hadoop will read when this counter is read for the first time It, then find the counter table, increase the counter value

(2) If the content of the standard error output is: reporter: status: message, it means that the message message is printed on the interface or the terminal, and may be some status prompt message

(3) If the contents of the error output is not the above two cases, said the debug log, Hadoop will be redirected to the stderr file. Note: Each Task corresponds to three log files, namely stdout, stderr and syslog, which are text files. You can view the contents of these three log files on the web interface. You can also log in to the node where the task is located and to the corresponding directory View.

In addition, it is important to note that the default key and value separator for the Map Task output is \ t. Hadoop splits the key and value according to \ t in the Map and Reduce phases and sorts the key. It is important to note that you can, of course, Use stream.map.output.field.separator to specify a new delimiter.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.