MR Summary (II)-Mapreduce Program Design

Last Update:2018-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Although many books describe the use of mapreduce APIs, they seldom describe how to design a MapReduce application. Mapreduce mainly comes from its simplicity. In addition to preparing input data, programmers only need to operate mapper and reducer. In reality, many problems can be solved using this method. In most cases

Although many books describe the use of mapreduce APIs, they seldom describe how to design a MapReduce application.

Mapreduce mainly comes from its simplicity. In addition to preparing input data, programmers only need to operate mapper and reducer. In reality, many problems can be solved using this method.

In most cases, MapReduce can be used as a general parallel execution framework to make full use of the local data. However, this simplicity is costly. Designers must decide how to express their business problems from the perspective of a small part of components that are combined in a specific way.

It is usually necessary to answer the following questions to re-develop the initial MapReduce question:

1. How to break down a big problem into multiple small tasks? More specifically, how do you break down the problem so that these small tasks can be executed in parallel?

2. Which key/value pair will you choose as the output/output of each task?

3. How do you summarize all the data required for calculation? More specifically, how do you arrange the processing method so that all the necessary computing data is in the memory at the same time?

We need to realize that many algorithms cannot be easily expressed as a single MapReduce job. It often needs to break down complex algorithms into a series of jobs, and output the data of one job into the input of the next job.

This section describes several examples (from simplicity to complexity) of designing different practical MapReduce applications ). All examples are described as follows:

Brief description of the problem
Mapreduce job description, including:

1. Mapper description

2. Cer CER description

In this case, mapreduce implementation is very simple-the only requirement is mapper, which processes each record separately and then outputs the result. In this example, Mapreduce controls the distribution of mappers and provides all support for scheduling and error handling. The following example shows how to design an application of this type.

Example of Face Recognition

Although it is not often discussed as a Hadoop-related issue, image processing is very suitable in mapreduce examples. Assume that a face recognition algorithm needs an image to recognize a series of desired features and generate a set of recognition results. Let us assume that we need to recognize faces on millions of images. If all images are stored in hadoop as sequential files, you can use a simple map job to implement parallel processing. In this example, the input key/value is ImageID/Image, and the output key/value is the ImageID/feature identification list. In addition, a group of identifiable features must be distributed to all mapper (for example, distributed cache ).

Face recognition jobs

Mapper	In this job, mapper first initializes the object with a recognizable feature set. For each image, a map function calls the face recognition algorithm through its image itself and a recognizable list. The recognition result is output from the map together with the original imageID.
Result	The execution result of this job is all images identified in the original image.

Note: mappers/reducers must be completely independent. Each mapper/reducer in the mapreduce application needs to create an independent output file. This means that the execution results of face recognition jobs are a group of files (under the same directory), each containing the output of their respective er. If you need to put them into a single file. A separate CER must be added to the face recognition job. This CER is very simple. In this example, each key used as the reduce input has only one unique value (Here we assume that the image ID is unique ), reducer only writes the Input key/value to the output file. In this example, although a CER is extremely simple, this additional job obviously increases the overall running time of the job. This is because the additional reducers are divided into shuffle and sort (not only in map jobs). When the number of images is very large, it will take a lot of time.

An example of this situation is to build inverted indexes. This type of problem requires all mapreduce steps to be executed. shuffle and sort are required to aggregate all the results. The following example shows how to design an application of this type.

Example of inverted index

In computer science, inverted indexes are a data framework used to store mappings between content (such as words or numbers) and its location in a document or a group of documents, see table 3-6. Inverted index is designed to achieve quick full-text search. When a document is added, the processing cost is added. inverted index data structure is a key part of a typical search engine, optimized the search speed for documents with certain words.

Document
ID	Title	Content
1	Popular	Football is Popular in US
2	Common Sport	Soccer is commonly played in Europe
3	National Sport	Cricket is played all over India
...	...	...

Table 2-1: document structure

Inverted index
Term	Value	Document	Document	Document
Title	Popular	1
Title	Sport	1	2	3
Title	Common	2
Title	National	3
Content	Football	1
Content	Is	1	2	3
Content	Popular	1
...	...	...	...	...

Table 2-2: inverted index

To create an inverted index, you can send each document (or line in the document) to mapper. Mapper can parse multiple words in the document and then output [word, word frequency] key-value pairs. CER can only be used to identify, output a list, or perform statistical summary for each word.

NoteIn chapter 9, you will learn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MR Summary (II)-Mapreduce Program Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MR Summary (II)-Mapreduce Program Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support