Introduction to Hadoop Streaming

Last Update:2018-12-05 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is implemented in Java, but we can also write MapReduce programs in other languages, such as Shell, Python, and Ruby. The following describes Hadoop Streaming and uses Python as an example.

1. Hadoop Streaming

The usage of Hadoop Streaming is as follows:

1 hadoop jar hadoop-streaming.jar -D property=value -mapper mapper.py -combiner combiner.py -reducer reducer.py -input Input -output Output -file mapper.py -file reducer.py

-Mapper-combiner and-CER specify Map Combine and Reduce programs.

-Input and-output specify the input and output folders.

-File can upload the local er. py and reducer. py to all computing nodes (tasktracker node ).

The-D parameter can specify many parameters, which are frequently used as follows:

Property	Value	Remarks
Mapred. reduce. tasks	Int	Reduce count. If it is 0, there is no reduce
Mapred. job. queue. name	String (queue name)	Queue name
Mapred. output. compress	Boolean	Whether to compress MapReduce output
Mapred. output. compression. codec	Org. apache. hadoop. io. compress. GzipCodec	MapReduce compression method. gzip is used here.
Stream. recordreader. compression	Gzip	Whether to read the compressed file. gzip can read the gz and bz2 compressed files.
Mapred. job. reduce. memory. mb	Int	Reduce physical memory size
Mapred. child. java. opts	"-Xmx" + Int + "mm"	Maximum Java stack size

The working principle of Hadoop Streaming is: Mapper and Reducer (automatic) read data (strings) from one row in the standard input, and then processed by the Mapper or Reducer logic, write the results to the standard output. The results are Key-Value pairs separated by tabs (\ t.

2. Python Hadoop Streaming version of WordCount:

Mapper. py

 1 import sys 2  3 # input comes from STDIN (standard input) 4 for line in sys.stdin: 5     # remove leading and trailing whitespace 6     line = line.strip() 7     # split the line into words 8     words = line.split() 9     # increase counters10     for word in words:11         # write the results to STDOUT (standard output);12         # what we output here will be the input for the13         # Reduce step, i.e. the input for reducer.py14         #15         # tab-delimited; the trivial word count is 116         print '%s\t%s' % (word, 1)

Mapper. the working principle of py is very simple. It reads data (line 4) from the standard input and removes line breaks (line 6) and word segmentation (line 8) at the end of the text ), output a Key-Value Pair (line 10-16) for each word (word, 1 ).

CER Cer. py

 1 import sys 2  3 current_word = None 4 current_count = 0 5 word = None 6  7 # input comes from STDIN 8 for line in sys.stdin: 9     # remove leading and trailing whitespace10     line = line.strip()11 12     # parse the input we got from mapper.py13     word, count = line.split('\t', 1)14 15     # convert count (currently a string) to int16     try:17         count = int(count)18     except ValueError:19         # count was not a number, so silently20         # ignore/discard this line21         continue22 23     # this IF-switch only works because Hadoop sorts map output24     # by key (here: word) before it is passed to the reducer25     if current_word == word:26         current_count += count27     else:28         if current_word:29             # write result to STDOUT30             print '%s\t%s' % (current_word, current_count)31         current_count = count32         current_word = word33 34 # do not forget to output the last word if needed!35 if current_word == word:36     print '%s\t%s' % (current_word, current_count)

CER Cer. py is slightly more complex. First, remove line breaks (Line 10), word segmentation (line 13), and type checks (line 16-21 ). Then, we use current_word to mark the words currently being processed (that is, the reduce logic). If the new words are the same as current_word, the words are accumulated (line 25-26 ), otherwise, the accumulation of current_word is over, and the result must be written to the standard output (line 27-32 ). In line 28, the reason for the judgment is that the initial value of current_word is none. We do not need to output the key-value pair with none as the key. At the end, we output the last key-Value Pair (line 35-36 ).

3. You can test the accuracy of the python hadoop streaming script locally:

1 cat file | python Mapper.py | sort -k1,1 | python Reducer.py

Among them, sort-K1, 1 simulates the shuffle process of hadoop, but because it runs locally and only one machine, the above test process is similar to that of the standalone version of hadoop, there is still a gap with the true distributed hadoop.

References:

[1]. hadoop streaming

[2]. Dong's blog: hadoop streaming Programming

[3]. Writing a hadoop mapreduce program in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More