Hadoop is implemented in Java, but we can also write MapReduce programs in other languages, such as Shell, Python, and Ruby. The following describes Hadoop Streaming and uses Python as an example.
1. Hadoop Streaming
The usage of Hadoop Streaming is as follows:
1 hadoop jar hadoop-streaming.jar -D property=value -mapper mapper.py -combiner combiner.py -reducer reducer.py -input Input -output Output -file mapper.py -file reducer.py
-Mapper-combiner and-CER specify Map Combine and Reduce programs.
-Input and-output specify the input and output folders.
-File can upload the local er. py and reducer. py to all computing nodes (tasktracker node ).
The-D parameter can specify many parameters, which are frequently used as follows:
Property |
Value |
Remarks |
Mapred. reduce. tasks |
Int |
Reduce count. If it is 0, there is no reduce |
Mapred. job. queue. name |
String (queue name) |
Queue name |
Mapred. output. compress |
Boolean |
Whether to compress MapReduce output |
Mapred. output. compression. codec |
Org. apache. hadoop. io. compress. GzipCodec |
MapReduce compression method. gzip is used here. |
Stream. recordreader. compression |
Gzip |
Whether to read the compressed file. gzip can read the gz and bz2 compressed files. |
Mapred. job. reduce. memory. mb |
Int |
Reduce physical memory size |
Mapred. child. java. opts |
"-Xmx" + Int + "mm" |
Maximum Java stack size |
The working principle of Hadoop Streaming is: Mapper and Reducer (automatic) read data (strings) from one row in the standard input, and then processed by the Mapper or Reducer logic, write the results to the standard output. The results are Key-Value pairs separated by tabs (\ t.
2. Python Hadoop Streaming version of WordCount:
Mapper. py
1 import sys 2 3 # input comes from STDIN (standard input) 4 for line in sys.stdin: 5 # remove leading and trailing whitespace 6 line = line.strip() 7 # split the line into words 8 words = line.split() 9 # increase counters10 for word in words:11 # write the results to STDOUT (standard output);12 # what we output here will be the input for the13 # Reduce step, i.e. the input for reducer.py14 #15 # tab-delimited; the trivial word count is 116 print '%s\t%s' % (word, 1)
Mapper. the working principle of py is very simple. It reads data (line 4) from the standard input and removes line breaks (line 6) and word segmentation (line 8) at the end of the text ), output a Key-Value Pair (line 10-16) for each word (word, 1 ).
CER Cer. py
1 import sys 2 3 current_word = None 4 current_count = 0 5 word = None 6 7 # input comes from STDIN 8 for line in sys.stdin: 9 # remove leading and trailing whitespace10 line = line.strip()11 12 # parse the input we got from mapper.py13 word, count = line.split('\t', 1)14 15 # convert count (currently a string) to int16 try:17 count = int(count)18 except ValueError:19 # count was not a number, so silently20 # ignore/discard this line21 continue22 23 # this IF-switch only works because Hadoop sorts map output24 # by key (here: word) before it is passed to the reducer25 if current_word == word:26 current_count += count27 else:28 if current_word:29 # write result to STDOUT30 print '%s\t%s' % (current_word, current_count)31 current_count = count32 current_word = word33 34 # do not forget to output the last word if needed!35 if current_word == word:36 print '%s\t%s' % (current_word, current_count)
CER Cer. py is slightly more complex. First, remove line breaks (Line 10), word segmentation (line 13), and type checks (line 16-21 ). Then, we use current_word to mark the words currently being processed (that is, the reduce logic). If the new words are the same as current_word, the words are accumulated (line 25-26 ), otherwise, the accumulation of current_word is over, and the result must be written to the standard output (line 27-32 ). In line 28, the reason for the judgment is that the initial value of current_word is none. We do not need to output the key-value pair with none as the key. At the end, we output the last key-Value Pair (line 35-36 ).
3. You can test the accuracy of the python hadoop streaming script locally:
1 cat file | python Mapper.py | sort -k1,1 | python Reducer.py
Among them, sort-K1, 1 simulates the shuffle process of hadoop, but because it runs locally and only one machine, the above test process is similar to that of the standalone version of hadoop, there is still a gap with the true distributed hadoop.
References:
[1]. hadoop streaming
[2]. Dong's blog: hadoop streaming Programming
[3]. Writing a hadoop mapreduce program in Python