Mapreduce Advanced Programming

Source: Internet
Author: User

1. Chaining mapreduce jobs task chain

2. Join data from different data source

<1>. Changing mapreduce jobs   1.1 chaining mapreduce jobs in a sequence

MapreduceProgramIt can execute some complex data processing tasks. Generally, this task needs to be divided into several smaller subtasks, and then each subtask is executed through the job in hadoop, the subtask results of the teaching plan are collected to complete the complex task.

The simplest thing is "sequential" execution. The programming model is also relatively simple. We know that jobclient is enabled in mapreduce programming. runjob (), this statement can be completed only after the job is completed, so if you want to execute the task in sequence, you only need to start a new task after each task is completed.

1.2 chaining mapreduce jobs with complex dependency

However, in more cases, simply executing in sequence is not enough. In hadoop, The jobcontrl class is provided to encapsulate a series of jobs and the dependencies between these jobs.

The addjob method is provided in jobcontrol to add a job to the set of this job;

At the same time, each job class provides the adddependingjob method.

For example, we need to run five jobs: job1, job2, job3, job4, and job5. If job2 needs to run after job1 is completed, job4 needs to run after job3 is completed, job5 can only run after job2 and job4 are completed. In this case, you can create a model as follows:

 

1.3 chaining preprocessing and postprocessing steps

In hadoop, each task can have multiple er and reducer. The execution sequence of the program is as follows:

The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last mapper will be written to the task's output.

This is similar to the pipeline command mapper1 | mapper2 | reducer1 | mapper3 | mapper4 in Linux.

<2>. Joining data from different sources  

If two data sources such as MERS MERs and orders exist,

 

If we want to analyze this data, we need to use the datajoin function of hadoop. First, let's take a look at the hadoop data join process.

First the map function reads data from MERs and orders and outputs <K1-2, V1-2>, <K2-1, V2-1>;

Then hadoop starts partition and shuffle. What is different from the original processing is that the same group key is packaged and sent to a reduce function;

In this case, the reduce function obtains a group of data with the same group key.

 

Then, in the reduce function, execute data join to form a combination, and then send the combination to the combine () function to output each result record.

 

The datajoinmapperbase base class and datajoinreducerbase base class exist in hadoop to implement data join. The ER er of data join must inherit from datajoinmapperbase, And the Mapper must implement three methods:

 

For CER that needs to implement data join, it needs to inherit from datajoinreducerbase. In this case, the combine method needs to be rewritten. The first thing that needs to be clarified is the object of combine:

For each tuple in the cross product, it callthe following method, which is expected to be implemented in a subclass. Protected abstract taggedmapoutput combine (object [] tags, object [] values );The above method is expected to produce one output value from an array of records of different sources.

Combine apparently targets an array of records of different sources, that is:

 

That is to say, combine is only responsible for obtaining the final output format.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.