Hadoop Streaming is a multi-language programming tool provided by Hadoop. Users can write MapReduce programs in any language. This article introduces several Hadoop Streaming programming examples, and you can focus on the following aspects:
(1) What programming specifications should be followed for how to write Mapper and Reduce for a language
(2) How to customize Hadoop Counter in Hadoop Streaming
(3) How to customize the state information in Hadoop Streaming, and then give the user feedback on the current job execution progress
(4) How to print debug logs in Hadoop Streaming, where you can see these logs
(5) How to use Hadoop Streaming to process binary files, not just text files
This article focuses on the first four issues, given the C + + and Shell prepared Wordcount examples for your reference.
1. C + + version of WordCount
(1) Mapper implementation (mapper.cpp)
12345678910111213141516 #include <iostream> #include <string> using namespace std; int main () {string key; while (cin >> key) {cout << key << "\ t" << "1" // Define counter named counter_no in group counter_group cerr << "reporter: counter: counter_group, counter_no, 1 \ n"; // dispaly status cerr << "reporter: status: processing ...... \ n"; / / Print logs for testing cerr << "This is log, will be printed in stdout file \ n";} return 0;}
(2) Reducer implementation (reducer.cpp)
12345678910111213141516171819202122 #include <iostream> #include <string> using namespace std; int main () {// The reducer will be encapsulated as a separate process, thus requiring the main function string cur_key, last_key, value; cin >> cur_key >> while (cin >> cur_key) {// read the map task output cin >> value; if (last_key! = cur_key) {// identify the next key cout << last_key "else" // Get all the same value of key n ++; // same value of key, cumulative value}} cout << last_key << "\ t" << n << endl; return 0;}
(3) compile and run
Compile the above two programs:
g ++ -o mapper mapper.cpp
g ++ -o reducer reducer.cpp
have a test:
echo "dong xicheng is here now, talk to dong xicheng now" | ./mapper | sort | ./reducer
Note: The above test method will frequently print the following string, you can comment out, these strings hadoop able to identify
reporter: counter: counter_group, counter_no, 1
reporter: status: processing ......
This is log, will be printed in stdout file
After the test is passed, the job can be submitted to the cluster (run_cpp_mr.sh) with the following script:
1234567891011121314 #! / Bin / bashHADOOP_HOME = / opt / yarn-clientINPUT_PATH = / test / inputOUTPUT_PATH = / test / outputecho "Clearing output path: $ OUTPUT_PATH" $ HADOOP_HOME / bin / hadoop fs -rmr $ OUTPUT_PATH $ {HADOOP_HOME} / bin / hadoop jar \ $ {HADOOP_HOME} /share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -files mapper, reducer \ -input $ INPUT_PATH \ -output $ OUTPUT_PATH \ -mapper mapper \ -reducer reducer
2. Shell version WordCount
(1) Mapper implementation (mapper.sh)
123456789101112131415 #! / Bin / bashwhile read LINE; do for word in $ LINE do echo "$ word 1" # in streaming, we define counter by # [reporter: counter: <group>, <counter>, <amount>] # define a counter named counter_no, in group counter_group # increase this counter by 1 # counter shoule be output output stderr echo "reporter: counter: counter_group, counter_no, 1"> & 2 echo "reporter: counter: status, processing ..... "> & 2 echo" This is log for testing, will be printed in stdout file "> & 2 donedone
(2) Reducer implementation (mapper.sh)
12345678910111213141516 #! / Bin / bashcount = 0started = 0word = "" while read LINE; do newword = `echo $ LINE | cut -d '' -f 1` if [" $ word "! =" $ Newword "]; then [$ started -ne 0] && echo "$ word \ t $ count" word = $ newword count = 1 started = 1 else count = $ (($ count + 1)) fidoneecho "$ word \ t $ count"
(3) test run
Test the above two programs:
echo "dong xicheng is here now, talk to dong xicheng now" | sh mapper.sh | sort | sh reducer.sh
Note: The above test method will frequently print the following string, you can comment out, these strings hadoop able to identify
reporter: counter: counter_group, counter_no, 1
reporter: status: processing ......
This is log, will be printed in stdout file
After the test is passed, the job can be submitted to the cluster (run_shell_mr.sh) with the following script:
1234567891011121314 #! / Bin / bashHADOOP_HOME = / opt / yarn-clientINPUT_PATH = / test / inputOUTPUT_PATH = / test / outputecho "Clearing output path: $ OUTPUT_PATH" $ HADOOP_HOME / bin / hadoop fs -rmr $ OUTPUT_PATH $ {HADOOP_HOME} / bin / hadoop jar \ $ {HADOOP_HOME} /share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -files mapper.sh, reducer.sh \ -input $ INPUT_PATH \ -output $ OUTPUT_PATH \ -mapper "sh mapper.sh "\ -reducer" sh reducer.sh "
3. Program Description
In Hadoop Streaming, standard input, standard output and error output each have magical effects, of which the standard input and output are used to accept input data and output processing results, and the meaning of error output depends on the content:
(1) If the content of the standard error is: reporter: counter: group, counter, amount, which means that the name is counter, and the value of the hadoop counter in the group is increased by the amount that hadoop will read when this counter is read for the first time It, then find the counter table, increase the counter value
(2) If the content of the standard error output is: reporter: status: message, it means that the message message is printed on the interface or the terminal, and may be some status prompt message
(3) If the contents of the error output is not the above two cases, said the debug log, Hadoop will be redirected to the stderr file. Note: Each Task corresponds to three log files, namely stdout, stderr and syslog, which are text files. You can view the contents of these three log files on the web interface. You can also log in to the node where the task is located and to the corresponding directory View.
In addition, it is important to note that the default key and value separator for the Map Task output is \ t. Hadoop splits the key and value according to \ t in the Map and Reduce phases and sorts the key. It is important to note that you can, of course, Use stream.map.output.field.separator to specify a new delimiter.