Dive into the Hadoop pipeline

Source: Internet
Author: User
Tags emit hadoop fs

The Hadoop pipeline is the McCartney of the Hadoopmapreduce C + + interface. Unlike streams, streams use standard inputs and outputs to communicate with each other between the map and the reduce nodes, using sockets as a channel between the processes of a map or reduce function written by Tasktracker and C + +. JNI is not being used.

We'll rewrite the example in C + + throughout this chapter, and then we'll see how to run it using a pipeline. Example 2-12 shows the source code of the map function and the reduce function written in C + + language.

Example 2-12: Highest temperature program written in C + + language

1. #include <algorithm>

2. #include <limits>

3. #include <string>

4.

5. #include "hadoop/pipes.hh"

6. #include "hadoop/templatefactory.hh"

7. #include "Hadoop/stringutils.hh"

8.

9. Class Maxtemperaturemapper:public Hadooppipes::mapper {

Public:

Maxtemperaturemapper (hadooppipes::taskcontext& context) {

12.}

void map (hadooppipes::mapcontext& context) {

std::string line = context. Getinputvalue ();

Std::string year = line. substr (15, 4);

std::string airtemperature = line. substr (87, 5);

std::string Q = line. substr (92, 1);

if (airtemperature! = "+9999" &&

(q = = "0" | | Q = = "1" | | Q = = "4" | | Q = = "5" | | Q = = "9")) {

Context.emit (year, airtemperature);

21.}

22.}

23.};

24.

25.

Class Maptemperaturereducer:public Hadooppipes::reducer {

. Public:

Maptemperaturereducer (hadooppipes::taskcontext& context) {

29.}

. void reduce (hadooppipes::reducecontext& context) {

maxValue int = Int_min;

The. while (Context.nextvalue ()) {

maxValue = std:: Max (MaxValue,

Hadooputils::toint (Context.getinputvalue ()));

35.}

Context.emit (Context.getinputkey (),

PNS hadooputils::tostring (MaxValue));

38.}

39.};

40.

The. int main (int argc, char *argv[]) {

Hadooppipes::runtask Return (hadooppipes::templatefactory

<maxtemperaturemapper,
Maptemperaturereducer>());

44.}

This application connects to the Hadoop C + + library, which is a lightweight wrapper for communicating with the Tasktracker child process. By extending the mapper and reducer classes in the Hadooppipes namespace and providing implementations of the map () and reduce () methods, we can define the map and reduce functions. These methods take a context object (either the Mapcontext type or the Reducecontext type), which provides read input and write output to access job configuration information through the Jobconf class. The process in this example is very similar to how Java is handled.

Unlike Java interfaces, the keys and values in the C + + interface are byte buffers and are represented as strings of the standard Template Library (LIBRARY,STL). This makes the interface simpler, even though it leaves the heavier burden on the application developer, since the developer must represent the string convert to and from two inverse operations. The developer must convert between strings and other types. This is evident in Maptemperaturereducer, where we have to convert the input value to an integer input value (using the convenient method in Hadooputils) and convert the maximum value to a string before it is written out. In some cases, we can omit this conversion, as in Maxtemperaturemapper, where its airtemperature value is never converted to an integer because it is never treated as a number in the map () method.

The main () method is the entry point for the application. It calls Hadooppipes::runtask and connects to the parent process and marshals data connected to Java from Mapper or reducer. The RunTask () method is passed in a factory parameter so that it can create an instance of mapper or reducer. One of its creation will be controlled by the Java parent process in the socket connection. We can use the overloaded template factory method to set up a combiner (combiner), a partitioner (Partitioner), a record reading function (Recorder reader), or a record writer function.

Compile run

Now we can compile the program of connection example 2-13 with Makerfile.

Example of the makefile of the 2-13:c++ version of the MapReduce program

1. CC = g+ +

2. cppflags =-m32-i$ (hadoop_install)/c++/$ (PLATFORM)/include

3.

4. Max_temperature:max_temperature.cpp

5. $ (CC) $ (cppflags) $< -wall

6.-l$ (Hadoop_install)/c++/$ (PLATFORM)/lib-lhadooppipes \

7.-lhadooputils-lpthread-g-o2-o [email protected]

Many environment variables should be set in makefile. In addition to Hadoop_install (which should be set up if you follow the installation instructions in Appendix A), you also need to define the platform, specifying the operating system, architecture, and data model (for example, 32-bit or 64-bit). I ran the following on the 32-bit Linux system machine compile:

1.% Export PLATFORM=Linux-i386-32

2.% make

The Max_temperature executable file is available in the current directory until the successful completion.

To run a pipeline job, we need to run Hadoop in pseudo-distributed (pseudo_distrinuted) mode, where all daemons are running on the local computer, and the installation steps are described in Appendix A. The pipeline does not run in standalone mode (running locally) because it relies on the distributed caching mechanism of Hadoop and runs only when HDFs is running.

Once the Hadoop daemon starts running, the first step is to copy the executables to HDFs so that they can be tasktracker out when they start the map and reduce tasks:

1.% Hadoop fs-put max_temperature bin/max_temperature

Sample data also needs to be copied from the local file system to HDFs:

1.% Hadoop fs-put input/ncdc/sample.txt sample.txt

You can now run the job. To make it work, we use the Hadoop Pipeline command, which uses the-program parameter to pass the URI of the executable in HDFs.

1.% Hadoop pipes\

2.-D Hadoop.pipes.java.recordreader\

3.-D Hadoop.pipes.java.recordwriter\

4. Inpit sample.txt\

5. Output output

6. Program Bin/max_temperature

We use the-D option to specify two properties: Hadoop.pipes.java.recordreader and Hadoop.pipes.java.recordwriter, both of which are set to true, indicating that we did not specify a c+++ Record read function or record write function, but we want to use the default Java settings (used to set the text input and output). The pipeline also allows you to set a Java Mapper,reducer,combiner or partitioner. In fact, in any job, you can mix Java classes or C + + classes.

The results are the same as the results from programs written in other languages.

Dive into the Hadoop pipeline

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.