Writing Hadoop MapReduce streaming with bash script

Source: Internet
Author: User

Writing Hadoop MapReduce streaming with bash script

tags (space delimited): Hadoop mapreduce Bash

MapReduce provides a multi-lingual capability to write Mr, which is Hadoop streaming. We can write mapper and reducer functions in the language we like and run the MapReduce job.

According to the definition of Hadoop streaming, as long as we are able to read the data from standard input, and then read the data from standard output, it is OK. But one thing to remember is that if you want to use your favorite language, such as Python, you have to install the language version and the corresponding LIB on the cluster beforehand. Here's an example of a shell script

Enter the matter text file, the function is to start counting the average length of a word from a particular character. You can do some checking in the program to ignore some characters, or you can use less pipes and some command to improve performance.

  1. Mapper Script:word_lenght.sh

    #!/bin/bash#This Mapper script would read one line at a time and then break it into words#For Each word starting letter and LENGTH of the word is emitted while ReadLine Do forWordinch $line  Doif[-N$word] ThenWcount= 'Echo $word|    Wc-m '; Wlength= ' Expr$wcount-1`; Letter= 'Echo $word| Head-c1 ';Echo - e "$letter\ t$wlength";fi Done Done#The output of the mapper would be ' starting letter of each word ' and ' its length ', separated by a tab space.
  2. Reducer Script:avg_word_length.sh
#!/bin/bash#This Reducer script would take-in output from the mapper and emit starting letter of each word and average length#Remember that the framework would sort the output from the mappers based on the Key#Note that the input to a reducer would be is of a form (Key,value) and not (Key,#This is unlike the input i.e, usually passed to a reducer written in Java.lastkey=""; count=0; total=0; iteration=1 while ReadLine DoNewkey= 'Echo $line| Awk' {print '} '' Value= 'Echo $line| Awk' {print $} '`if["$iteration"=="1"] Thenlastkey=$newkey; c iteration= ' expr$iteration+1`;fi   if[["$lastkey"!="$newkey"]] ThenAverage= 'Echo "SCALE=5; $total / $count"| BC ';Echo - e "$lastkey\ t$average"Count=0; lastkey=$newkey; Total=0; Average=0;fiTotal= ' Expr$total+$value`; lastkey=$newkey; Count= ' Expr$count+1`; Done#The output would is key,value pairs (letter,average length of the words starting with this letter)

3. Run command

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming*.jar -input /input -output /avgwl -mapper mapper.sh -reducer reducer.sh -file /home/user/mr_streaming_bash/mapper.sh -file /home/user/mr_streaming_bash/reducer.sh 

You can also compare performance with other language shoe mapreduce

Translation: Hadoop MapReduce streaming Using Bash Script

Google Daniel's GitHub Project
MapReduce in Bash

Writing Hadoop MapReduce streaming with bash script

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.