Mapreduce Advanced Programming

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Chaining mapreduce jobs task chain

2. Join data from different data source

<1>. Changing mapreduce jobs 1.1 chaining mapreduce jobs in a sequence

MapreduceProgramIt can execute some complex data processing tasks. Generally, this task needs to be divided into several smaller subtasks, and then each subtask is executed through the job in hadoop, the subtask results of the teaching plan are collected to complete the complex task.

The simplest thing is "sequential" execution. The programming model is also relatively simple. We know that jobclient is enabled in mapreduce programming. runjob (), this statement can be completed only after the job is completed, so if you want to execute the task in sequence, you only need to start a new task after each task is completed.

1.2 chaining mapreduce jobs with complex dependency

However, in more cases, simply executing in sequence is not enough. In hadoop, The jobcontrl class is provided to encapsulate a series of jobs and the dependencies between these jobs.

The addjob method is provided in jobcontrol to add a job to the set of this job;

At the same time, each job class provides the adddependingjob method.

For example, we need to run five jobs: job1, job2, job3, job4, and job5. If job2 needs to run after job1 is completed, job4 needs to run after job3 is completed, job5 can only run after job2 and job4 are completed. In this case, you can create a model as follows:

1.3 chaining preprocessing and postprocessing steps

In hadoop, each task can have multiple er and reducer. The execution sequence of the program is as follows:

The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last mapper will be written to the task's output.

This is similar to the pipeline command mapper1 | mapper2 | reducer1 | mapper3 | mapper4 in Linux.

<2>. Joining data from different sources

If two data sources such as MERS MERs and orders exist,

If we want to analyze this data, we need to use the datajoin function of hadoop. First, let's take a look at the hadoop data join process.

First the map function reads data from MERs and orders and outputs <K1-2, V1-2>, <K2-1, V2-1>;

Then hadoop starts partition and shuffle. What is different from the original processing is that the same group key is packaged and sent to a reduce function;

In this case, the reduce function obtains a group of data with the same group key.

Then, in the reduce function, execute data join to form a combination, and then send the combination to the combine () function to output each result record.

The datajoinmapperbase base class and datajoinreducerbase base class exist in hadoop to implement data join. The ER er of data join must inherit from datajoinmapperbase, And the Mapper must implement three methods:

For CER that needs to implement data join, it needs to inherit from datajoinreducerbase. In this case, the combine method needs to be rewritten. The first thing that needs to be clarified is the object of combine:

For each tuple in the cross product, it callthe following method, which is expected to be implemented in a subclass. Protected abstract taggedmapoutput combine (object [] tags, object [] values );The above method is expected to produce one output value from an array of records of different sources.

Combine apparently targets an array of records of different sources, that is:

That is to say, combine is only responsible for obtaining the final output format.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mapreduce Advanced Programming

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mapreduce Advanced Programming

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support