Hadoop provides mapreduce with an API that allows you to write map and reduce functions in languages other than Java: hadoop streaming uses standard streamams) as an interface for data transmission between hadoop and applications. Therefore, you can write the map and reduce functions in any language, as long as it can read data from the standard input stream (stdin) and write the output data to the standard output stream (stdout.
Streaming is essentially perfect for processing text data. Map input data is transmitted to the map function through stdin. The map function processes the input data row by row, and the processed key-value pairs separated by tab are written to stdout, the reduce function receives the map output from stdin (hadoop framework ensures that the map output is sorted by keywords), and then writes the processed reduce output to stdout.
Using Ruby as an Example
Take the program for maxtemperature calculation in Study Notes (1) as an example. The ruby code of the map function is as follows (even if you have never touched Ruby like me, you can read the following code):
1 #!/usr/bin/env ruby2 STDIN.each_line do |line|3 val = line4 year, temp, q = val[15,4], val[87,5], val[92,1]5 puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)6 end
This program executes this code block for each row of data read from stdin. This code block extracts related fields (year, temperature, and quality) from each line. If the temperature value is valid, write the year and temperature values to stdout (separated by the tab key ).
Because this script only runs on standard input and output, the simplest way is to test on UNIX pipe, instead of using hadoop:
% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb 1950 + 0000 1950 + 0022 1950-0011 1949 + 0111 1949 + 0078 |
Correspondingly, reduce functions in Ruby form are as follows:
1 #!/usr/bin/env ruby 2 last_key, max_val = nil, -1000000 3 STDIN.each_line do |line| 4 key, val = line.split("\t") 5 if last_key && last_key != key 6 puts "#{last_key}\t#{max_val}" 7 last_key, max_val = key, val.to_i 8 else 9 last_key, max_val = key, [max_val, val.to_i].max10 end11 end12 puts "#{last_key}\t#{max_val}" if last_key
Similarly, this code iterates each row read from stdin, but we must save some data for each key group currently being processed. In this example, the key is the year value. we store the last_key of the last processed year value and the max_val of the corresponding key group so far. The mapreduce framework will ensure that these key values are sorted. So when we encounter a different key value, it means that the previous key group has been processed, max_val is the maximum temperature value of the key group. By comparing Java implementations, we can find that Java APIs are automatically grouped for you, and you have to determine the key group boundaries in ruby.
For each row, we extract the year and temperature values. If we have processed a group of key-Value Pair (last_key & last_key! = Key), we will write last_key and max_value (still separated by the tab key) to stdout, and then reset last_key and max_value to the corresponding new value. If a key group has not been processed, We will update the max_value corresponding to the current last_key. Now we still use Unix pipe to simulate the entire mapreduce process:
% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb | \ Sort | ch02/src/main/Ruby/max_temperature_performance.rb 1949 111 1950 22 |
As you can see, this output is the same as that of Java. Now we use hadoop to run it. Because the hadoop command does not support the streaming option, you must use the jar option to declare that you want to process streaming jar files. As follows:
% Hadoop jar $ hadoop_install/contrib/streaming/hadoop-*-streaming. Jar \ -Input/ncdc/sample.txt \ -Output \ -Mapper ch02/src/main/Ruby/max_temperature_map.rb \ -Reducer ch02/src/main/Ruby/max_temperature_cece.rb |
If you are running the modification program on a large cluster, you can use the-combiner option to declare a combiner. (For more information about combiner, see Study Notes (3 )). Earlier versions of combiner may be limited to Java code implementation, but in Versions later than 1.x, combiner can be any streaming command. As follows:
% Hadoop jar $ hadoop_install/contrib/streaming/hadoop-*-streaming. Jar \ -Input/ncdc/all \ -Output \ -Mapper "ch02/src/main/Ruby/max_temperature_map.rb | sort | Ch02/src/main/Ruby/max_temperature_performance.rb "\ -Reducer ch02/src/main/Ruby/max_temperature_reduce.rb \ -File ch02/src/main/Ruby/max_temperature_map.rb \ -File ch02/src/main/Ruby/max_temperature_performance.rb |
In this example, we specify the input file as a directory (-input/ncdc/scripts file, that is, Gzip compressed file: 20.1.gz,20.2.gz ). The-file option indicates the file to be copied to the cluster (this is not required if it is running in singleton mode ). It is worth noting that the-mapper option in this example also specifies Mapper and combiner (that is, max_temperature_cece.rb). Do you still remember that you are taking notes (3) java code combiner and CER are implemented in the same way)
Python is used as an example.
If you are familiar with python, you can refer to the code of MAP and reduce functions in Python ::
1 #!/usr/bin/env python2 import re3 import sys4 for line in sys.stdin:5 val = line.strip()6 (year, temp, q) = (val[15:19], val[87:92], val[92:93])7 if (temp != "+9999" and re.match("[01459]", q)):8 print "%s\t%s" % (year, temp)
1 #!/usr/bin/env python 2 import sys 3 (last_key, max_val) = (None, -sys.maxint) 4 for line in sys.stdin: 5 (key, val) = line.strip().split("\t") 6 if last_key and last_key != key: 7 print "%s\t%s" % (last_key, max_val) 8 (last_key, max_val) = (key, int(val)) 9 else:10 (last_key, max_val) = (key, max(max_val, int(val)))11 if last_key:12 print "%s\t%s" % (last_key, max_val)
Run the following command to run the test script:
% Cat input/ncdc/sample.txt | ch02/src/main/Python/max_temperature_map.py | \ Sort | ch02/src/main/Python/max_temperature_performance.py 1949 111 1950 22 |
Reprinted please indicate the source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2807759.html