Writing Hadoop MapReduce streaming with bash script
tags (space delimited): Hadoop mapreduce Bash
MapReduce provides a multi-lingual capability to write Mr, which is Hadoop streaming. We can write mapper and reducer functions in the language we like and run the MapReduce job.
According to the definition of Hadoop streaming, as long as we are able to read the data from standard input, and then read the data from standard output, it is OK. But one thing to remember is that if you want to use your favorite language, such as Python, you have to install the language version and the corresponding LIB on the cluster beforehand. Here's an example of a shell script
Enter the matter text file, the function is to start counting the average length of a word from a particular character. You can do some checking in the program to ignore some characters, or you can use less pipes and some command to improve performance.
Mapper Script:word_lenght.sh
#!/bin/bash#This Mapper script would read one line at a time and then break it into words#For Each word starting letter and LENGTH of the word is emitted while ReadLine Do forWordinch $line Doif[-N$word] ThenWcount= 'Echo $word| Wc-m '; Wlength= ' Expr$wcount-1`; Letter= 'Echo $word| Head-c1 ';Echo - e "$letter\ t$wlength";fi Done Done#The output of the mapper would be ' starting letter of each word ' and ' its length ', separated by a tab space.
- Reducer Script:avg_word_length.sh
#!/bin/bash#This Reducer script would take-in output from the mapper and emit starting letter of each word and average length#Remember that the framework would sort the output from the mappers based on the Key#Note that the input to a reducer would be is of a form (Key,value) and not (Key,#This is unlike the input i.e, usually passed to a reducer written in Java.lastkey=""; count=0; total=0; iteration=1 while ReadLine DoNewkey= 'Echo $line| Awk' {print '} '' Value= 'Echo $line| Awk' {print $} '`if["$iteration"=="1"] Thenlastkey=$newkey; c iteration= ' expr$iteration+1`;fi if[["$lastkey"!="$newkey"]] ThenAverage= 'Echo "SCALE=5; $total / $count"| BC ';Echo - e "$lastkey\ t$average"Count=0; lastkey=$newkey; Total=0; Average=0;fiTotal= ' Expr$total+$value`; lastkey=$newkey; Count= ' Expr$count+1`; Done#The output would is key,value pairs (letter,average length of the words starting with this letter)
3. Run command
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming*.jar -input /input -output /avgwl -mapper mapper.sh -reducer reducer.sh -file /home/user/mr_streaming_bash/mapper.sh -file /home/user/mr_streaming_bash/reducer.sh
You can also compare performance with other language shoe mapreduce
Translation: Hadoop MapReduce streaming Using Bash Script
Google Daniel's GitHub Project
MapReduce in Bash
Writing Hadoop MapReduce streaming with bash script