Dive into the Hadoop pipeline

Last Update:2015-03-20 Source: Internet

Author: User

Tags emit hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Hadoop pipeline is the McCartney of the Hadoopmapreduce C + + interface. Unlike streams, streams use standard inputs and outputs to communicate with each other between the map and the reduce nodes, using sockets as a channel between the processes of a map or reduce function written by Tasktracker and C + +. JNI is not being used.

We'll rewrite the example in C + + throughout this chapter, and then we'll see how to run it using a pipeline. Example 2-12 shows the source code of the map function and the reduce function written in C + + language.

Example 2-12: Highest temperature program written in C + + language

1. #include <algorithm>

2. #include <limits>

3. #include <string>

5. #include "hadoop/pipes.hh"

6. #include "hadoop/templatefactory.hh"

7. #include "Hadoop/stringutils.hh"

9. Class Maxtemperaturemapper:public Hadooppipes::mapper {

Public:

Maxtemperaturemapper (hadooppipes::taskcontext& context) {

12.}

void map (hadooppipes::mapcontext& context) {

std::string line = context. Getinputvalue ();

Std::string year = line. substr (15, 4);

std::string airtemperature = line. substr (87, 5);

std::string Q = line. substr (92, 1);

if (airtemperature! = "+9999" &&

(q = = "0" | | Q = = "1" | | Q = = "4" | | Q = = "5" | | Q = = "9")) {

Context.emit (year, airtemperature);

21.}

22.}

23.};

24.

25.

Class Maptemperaturereducer:public Hadooppipes::reducer {

. Public:

Maptemperaturereducer (hadooppipes::taskcontext& context) {

29.}

. void reduce (hadooppipes::reducecontext& context) {

maxValue int = Int_min;

The. while (Context.nextvalue ()) {

maxValue = std:: Max (MaxValue,

Hadooputils::toint (Context.getinputvalue ()));

35.}

Context.emit (Context.getinputkey (),

PNS hadooputils::tostring (MaxValue));

38.}

39.};

40.

The. int main (int argc, char *argv[]) {

Hadooppipes::runtask Return (hadooppipes::templatefactory

<maxtemperaturemapper,
Maptemperaturereducer>());

44.}

This application connects to the Hadoop C + + library, which is a lightweight wrapper for communicating with the Tasktracker child process. By extending the mapper and reducer classes in the Hadooppipes namespace and providing implementations of the map () and reduce () methods, we can define the map and reduce functions. These methods take a context object (either the Mapcontext type or the Reducecontext type), which provides read input and write output to access job configuration information through the Jobconf class. The process in this example is very similar to how Java is handled.

Unlike Java interfaces, the keys and values in the C + + interface are byte buffers and are represented as strings of the standard Template Library (LIBRARY,STL). This makes the interface simpler, even though it leaves the heavier burden on the application developer, since the developer must represent the string convert to and from two inverse operations. The developer must convert between strings and other types. This is evident in Maptemperaturereducer, where we have to convert the input value to an integer input value (using the convenient method in Hadooputils) and convert the maximum value to a string before it is written out. In some cases, we can omit this conversion, as in Maxtemperaturemapper, where its airtemperature value is never converted to an integer because it is never treated as a number in the map () method.

The main () method is the entry point for the application. It calls Hadooppipes::runtask and connects to the parent process and marshals data connected to Java from Mapper or reducer. The RunTask () method is passed in a factory parameter so that it can create an instance of mapper or reducer. One of its creation will be controlled by the Java parent process in the socket connection. We can use the overloaded template factory method to set up a combiner (combiner), a partitioner (Partitioner), a record reading function (Recorder reader), or a record writer function.

Compile run

Now we can compile the program of connection example 2-13 with Makerfile.

Example of the makefile of the 2-13:c++ version of the MapReduce program

1. CC = g+ +

2. cppflags =-m32-i$ (hadoop_install)/c++/$ (PLATFORM)/include

4. Max_temperature:max_temperature.cpp

5. $ (CC) $ (cppflags) $< -wall

6.-l$ (Hadoop_install)/c++/$ (PLATFORM)/lib-lhadooppipes \

7.-lhadooputils-lpthread-g-o2-o [email protected]

Many environment variables should be set in makefile. In addition to Hadoop_install (which should be set up if you follow the installation instructions in Appendix A), you also need to define the platform, specifying the operating system, architecture, and data model (for example, 32-bit or 64-bit). I ran the following on the 32-bit Linux system machine compile:

1.% Export PLATFORM=Linux-i386-32

2.% make

The Max_temperature executable file is available in the current directory until the successful completion.

To run a pipeline job, we need to run Hadoop in pseudo-distributed (pseudo_distrinuted) mode, where all daemons are running on the local computer, and the installation steps are described in Appendix A. The pipeline does not run in standalone mode (running locally) because it relies on the distributed caching mechanism of Hadoop and runs only when HDFs is running.

Once the Hadoop daemon starts running, the first step is to copy the executables to HDFs so that they can be tasktracker out when they start the map and reduce tasks:

1.% Hadoop fs-put max_temperature bin/max_temperature

Sample data also needs to be copied from the local file system to HDFs:

1.% Hadoop fs-put input/ncdc/sample.txt sample.txt

You can now run the job. To make it work, we use the Hadoop Pipeline command, which uses the-program parameter to pass the URI of the executable in HDFs.

1.% Hadoop pipes\

2.-D Hadoop.pipes.java.recordreader\

3.-D Hadoop.pipes.java.recordwriter\

4. Inpit sample.txt\

5. Output output

6. Program Bin/max_temperature

We use the-D option to specify two properties: Hadoop.pipes.java.recordreader and Hadoop.pipes.java.recordwriter, both of which are set to true, indicating that we did not specify a c+++ Record read function or record write function, but we want to use the default Java settings (used to set the text input and output). The pipeline also allows you to set a Java Mapper,reducer,combiner or partitioner. In fact, in any job, you can mix Java classes or C + + classes.

The results are the same as the results from programs written in other languages.

Dive into the Hadoop pipeline

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More