Hadoop--07--mapreduce Advanced Programming

Last Update:2016-07-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

<1>. changing MapReduce jobs 1.1 Chaining MapReduce jobs in a sequence

The MapReduce program is capable of performing some complex data processing, typically by splitting the task tasks into smaller subtask, then each subtask is run through the job in Hadoop, and then the lesson plan subtask results are collected. Complete this complex task.

The simplest is "order" executed. The programming model is also relatively simple. We know that starting a task Jobclient.runjob () in MapReduce programming, when the task job is completed, the statement can end, so if you want to perform the task sequentially, you only need to restart a new task after each task is completed.

1.2 Chaining MapReduce jobs with complex dependency

But in more cases it is not enough to simply execute sequentially, and the Jobcontrl class is provided in Hadoop to encapsulate a series of jobs and the dependencies between those jobs.

The AddJob method is provided in Jobcontrol to add a job to the job's collection;

The Adddependingjob method is provided for each job class at the same time.

For example, we now need to run 5 job:job1, JOB2, JOB3, JOB4, job5 if the JOB2 needs to run after the JOB1 is complete, JOB4 needs to run after the JOB3 run, The final job5 needs to run after Job2 and JOB4 are complete. Then you can build the model like this:

1.3 Chaining preprocessing and postprocessing steps

Each task in Hadoop can have multiple mapper and reducer, and the program executes in the following order:

The Mapper classes is invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, a nd so on until the last Mapper, the output of the last Mapper would be written to the task ' s output.

Which is like a pipeline command Mapper1 under Linux | Mapper2 | Reducer1 | Mapper3 | Mapper4.

<2>. Joining data from different sources

If there are two data sources such as customers and orders,

If we want to analyze this data, then we need to use the Datajoin feature of Hadoop. First, let's look at the process of Hadoop data join.

First the map function reads data from customers and orders, outputs <k1-2, v1-2>,<k2-1, v2-1>;

Then Hadoop begins to Partition,shuffle, which is different from what was originally dealt with by sending the same package of group key to a reduce function;

The reduce function will then get the same set of data as group key.

Then in the reduce function, the data join is executed to form combination, and then the combination is sent to the Combine () function to output each result record.

There are datajoinmapperbase base classes and datajoinreducerbase base classes in Hadoop that are used to implement data joins. The mapper for the data join needs to inherit from Datajoinmapperbase, and the mapper needs to implement three methods:

For reducer that need to implement the data join, you need to inherit from Datajoinreducerbase, when you need to override the method combine, the first thing you need to be clear is the combine effect of the object:

For each of the tuple in the cross product, it calls the following method, which are expected to being implemented in a subclass. Protected abstract Taggedmapoutput Combine (object[] tags, object[] values); The above method is expected to produce one output value from an array of records of different sources.

Combine is clearly targeted at an array of records of different sources, namely:

This means that combine is only responsible for getting the final output format.

Hadoop--07--mapreduce Advanced Programming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop--07--mapreduce Advanced Programming

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop--07--mapreduce Advanced Programming

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support