Introduction to Hadoop Streaming

Source: Internet
Author: User
Tags hadoop mapreduce

Hadoop is implemented in Java, but we can also write MapReduce programs in other languages, such as Shell, Python, and Ruby. The following describes Hadoop Streaming and uses Python as an example.

1. Hadoop Streaming

The usage of Hadoop Streaming is as follows:

1 hadoop jar hadoop-streaming.jar -D property=value -mapper mapper.py -combiner combiner.py -reducer reducer.py -input Input -output Output -file mapper.py -file reducer.py

-Mapper-combiner and-CER specify Map Combine and Reduce programs.

-Input and-output specify the input and output folders.

-File can upload the local er. py and reducer. py to all computing nodes (tasktracker node ).

The-D parameter can specify many parameters, which are frequently used as follows:

Property Value Remarks

Mapred. reduce. tasks

Int Reduce count. If it is 0, there is no reduce

Mapred. job. queue. name

String (queue name) Queue name

Mapred. output. compress

Boolean Whether to compress MapReduce output

Mapred. output. compression. codec

Org. apache. hadoop. io. compress. GzipCodec

MapReduce compression method. gzip is used here.

Stream. recordreader. compression

Gzip Whether to read the compressed file. gzip can read the gz and bz2 compressed files.

Mapred. job. reduce. memory. mb

Int Reduce physical memory size

Mapred. child. java. opts

"-Xmx" + Int + "mm" Maximum Java stack size

The working principle of Hadoop Streaming is: Mapper and Reducer (automatic) read data (strings) from one row in the standard input, and then processed by the Mapper or Reducer logic, write the results to the standard output. The results are Key-Value pairs separated by tabs (\ t.

2. Python Hadoop Streaming version of WordCount:

Mapper. py

 1 import sys 2  3 # input comes from STDIN (standard input) 4 for line in sys.stdin: 5     # remove leading and trailing whitespace 6     line = line.strip() 7     # split the line into words 8     words = line.split() 9     # increase counters10     for word in words:11         # write the results to STDOUT (standard output);12         # what we output here will be the input for the13         # Reduce step, i.e. the input for reducer.py14         #15         # tab-delimited; the trivial word count is 116         print '%s\t%s' % (word, 1)

Mapper. the working principle of py is very simple. It reads data (line 4) from the standard input and removes line breaks (line 6) and word segmentation (line 8) at the end of the text ), output a Key-Value Pair (line 10-16) for each word (word, 1 ).

CER Cer. py

 1 import sys 2  3 current_word = None 4 current_count = 0 5 word = None 6  7 # input comes from STDIN 8 for line in sys.stdin: 9     # remove leading and trailing whitespace10     line = line.strip()11 12     # parse the input we got from mapper.py13     word, count = line.split('\t', 1)14 15     # convert count (currently a string) to int16     try:17         count = int(count)18     except ValueError:19         # count was not a number, so silently20         # ignore/discard this line21         continue22 23     # this IF-switch only works because Hadoop sorts map output24     # by key (here: word) before it is passed to the reducer25     if current_word == word:26         current_count += count27     else:28         if current_word:29             # write result to STDOUT30             print '%s\t%s' % (current_word, current_count)31         current_count = count32         current_word = word33 34 # do not forget to output the last word if needed!35 if current_word == word:36     print '%s\t%s' % (current_word, current_count)

CER Cer. py is slightly more complex. First, remove line breaks (Line 10), word segmentation (line 13), and type checks (line 16-21 ). Then, we use current_word to mark the words currently being processed (that is, the reduce logic). If the new words are the same as current_word, the words are accumulated (line 25-26 ), otherwise, the accumulation of current_word is over, and the result must be written to the standard output (line 27-32 ). In line 28, the reason for the judgment is that the initial value of current_word is none. We do not need to output the key-value pair with none as the key. At the end, we output the last key-Value Pair (line 35-36 ).

3. You can test the accuracy of the python hadoop streaming script locally:

1 cat file | python Mapper.py | sort -k1,1 | python Reducer.py

Among them, sort-K1, 1 simulates the shuffle process of hadoop, but because it runs locally and only one machine, the above test process is similar to that of the standalone version of hadoop, there is still a gap with the true distributed hadoop.

References:

[1]. hadoop streaming

[2]. Dong's blog: hadoop streaming Programming

[3]. Writing a hadoop mapreduce program in Python

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.