Hadoop learning notes (4): streaming in hadoop

Source: Internet
Author: User
Tags processing text

Hadoop provides mapreduce with an API that allows you to write map and reduce functions in languages other than Java: hadoop streaming uses standard streamams) as an interface for data transmission between hadoop and applications. Therefore, you can write the map and reduce functions in any language, as long as it can read data from the standard input stream (stdin) and write the output data to the standard output stream (stdout.

Streaming is essentially perfect for processing text data. Map input data is transmitted to the map function through stdin. The map function processes the input data row by row, and the processed key-value pairs separated by tab are written to stdout, the reduce function receives the map output from stdin (hadoop framework ensures that the map output is sorted by keywords), and then writes the processed reduce output to stdout.

Using Ruby as an Example

Take the program for maxtemperature calculation in Study Notes (1) as an example. The ruby code of the map function is as follows (even if you have never touched Ruby like me, you can read the following code):

1 #!/usr/bin/env ruby2 STDIN.each_line do |line|3   val = line4   year, temp, q = val[15,4], val[87,5], val[92,1]5   puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)6 end

This program executes this code block for each row of data read from stdin. This code block extracts related fields (year, temperature, and quality) from each line. If the temperature value is valid, write the year and temperature values to stdout (separated by the tab key ).

Because this script only runs on standard input and output, the simplest way is to test on UNIX pipe, instead of using hadoop:

% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb
1950 + 0000
1950 + 0022
1950-0011
1949 + 0111
1949 + 0078

Correspondingly, reduce functions in Ruby form are as follows:

 1 #!/usr/bin/env ruby 2 last_key, max_val = nil, -1000000 3 STDIN.each_line do |line| 4   key, val = line.split("\t") 5   if last_key && last_key != key 6     puts "#{last_key}\t#{max_val}" 7     last_key, max_val = key, val.to_i 8   else 9     last_key, max_val = key, [max_val, val.to_i].max10   end11 end12 puts "#{last_key}\t#{max_val}" if last_key

Similarly, this code iterates each row read from stdin, but we must save some data for each key group currently being processed. In this example, the key is the year value. we store the last_key of the last processed year value and the max_val of the corresponding key group so far. The mapreduce framework will ensure that these key values are sorted. So when we encounter a different key value, it means that the previous key group has been processed, max_val is the maximum temperature value of the key group. By comparing Java implementations, we can find that Java APIs are automatically grouped for you, and you have to determine the key group boundaries in ruby.

For each row, we extract the year and temperature values. If we have processed a group of key-Value Pair (last_key & last_key! = Key), we will write last_key and max_value (still separated by the tab key) to stdout, and then reset last_key and max_value to the corresponding new value. If a key group has not been processed, We will update the max_value corresponding to the current last_key. Now we still use Unix pipe to simulate the entire mapreduce process:

% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb | \
Sort | ch02/src/main/Ruby/max_temperature_performance.rb
1949 111
1950 22

As you can see, this output is the same as that of Java. Now we use hadoop to run it. Because the hadoop command does not support the streaming option, you must use the jar option to declare that you want to process streaming jar files. As follows:

% Hadoop jar $ hadoop_install/contrib/streaming/hadoop-*-streaming. Jar \
-Input/ncdc/sample.txt \
-Output \
-Mapper ch02/src/main/Ruby/max_temperature_map.rb \
-Reducer ch02/src/main/Ruby/max_temperature_cece.rb

If you are running the modification program on a large cluster, you can use the-combiner option to declare a combiner. (For more information about combiner, see Study Notes (3 )). Earlier versions of combiner may be limited to Java code implementation, but in Versions later than 1.x, combiner can be any streaming command. As follows:

% Hadoop jar $ hadoop_install/contrib/streaming/hadoop-*-streaming. Jar \
-Input/ncdc/all \
-Output \
-Mapper "ch02/src/main/Ruby/max_temperature_map.rb | sort |
Ch02/src/main/Ruby/max_temperature_performance.rb "\
-Reducer ch02/src/main/Ruby/max_temperature_reduce.rb \
-File ch02/src/main/Ruby/max_temperature_map.rb \
-File ch02/src/main/Ruby/max_temperature_performance.rb

In this example, we specify the input file as a directory (-input/ncdc/scripts file, that is, Gzip compressed file: 20.1.gz,20.2.gz ). The-file option indicates the file to be copied to the cluster (this is not required if it is running in singleton mode ). It is worth noting that the-mapper option in this example also specifies Mapper and combiner (that is, max_temperature_cece.rb). Do you still remember that you are taking notes (3) java code combiner and CER are implemented in the same way)

Python is used as an example.

If you are familiar with python, you can refer to the code of MAP and reduce functions in Python ::

1 #!/usr/bin/env python2 import re3 import sys4 for line in sys.stdin:5   val = line.strip()6   (year, temp, q) = (val[15:19], val[87:92], val[92:93])7   if (temp != "+9999" and re.match("[01459]", q)):8     print "%s\t%s" % (year, temp)
 1 #!/usr/bin/env python 2 import sys 3 (last_key, max_val) = (None, -sys.maxint) 4 for line in sys.stdin: 5   (key, val) = line.strip().split("\t") 6   if last_key and last_key != key: 7     print "%s\t%s" % (last_key, max_val) 8     (last_key, max_val) = (key, int(val)) 9   else:10     (last_key, max_val) = (key, max(max_val, int(val)))11 if last_key:12   print "%s\t%s" % (last_key, max_val)

Run the following command to run the test script:

% Cat input/ncdc/sample.txt | ch02/src/main/Python/max_temperature_map.py | \
Sort | ch02/src/main/Python/max_temperature_performance.py
1949 111
1950 22

Reprinted please indicate the source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2807759.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.