MR Summary (II)-Mapreduce Program Design

Source: Internet
Author: User
Although many books describe the use of mapreduce APIs, they seldom describe how to design a MapReduce application. Mapreduce mainly comes from its simplicity. In addition to preparing input data, programmers only need to operate mapper and reducer. In reality, many problems can be solved using this method. In most cases

Although many books describe the use of mapreduce APIs, they seldom describe how to design a MapReduce application. Mapreduce mainly comes from its simplicity. In addition to preparing input data, programmers only need to operate mapper and reducer. In reality, many problems can be solved using this method. In most cases

Although many books describe the use of mapreduce APIs, they seldom describe how to design a MapReduce application.

Mapreduce mainly comes from its simplicity. In addition to preparing input data, programmers only need to operate mapper and reducer. In reality, many problems can be solved using this method.

In most cases, MapReduce can be used as a general parallel execution framework to make full use of the local data. However, this simplicity is costly. Designers must decide how to express their business problems from the perspective of a small part of components that are combined in a specific way.

It is usually necessary to answer the following questions to re-develop the initial MapReduce question:

1. How to break down a big problem into multiple small tasks? More specifically, how do you break down the problem so that these small tasks can be executed in parallel?

2. Which key/value pair will you choose as the output/output of each task?

3. How do you summarize all the data required for calculation? More specifically, how do you arrange the processing method so that all the necessary computing data is in the memory at the same time?

We need to realize that many algorithms cannot be easily expressed as a single MapReduce job. It often needs to break down complex algorithms into a series of jobs, and output the data of one job into the input of the next job.

This section describes several examples (from simplicity to complexity) of designing different practical MapReduce applications ). All examples are described as follows:

  • Brief description of the problem
  • Mapreduce job description, including:

1. Mapper description

2. Cer CER description

In this case, mapreduce implementation is very simple-the only requirement is mapper, which processes each record separately and then outputs the result. In this example, Mapreduce controls the distribution of mappers and provides all support for scheduling and error handling. The following example shows how to design an application of this type.

Example of Face Recognition

Although it is not often discussed as a Hadoop-related issue, image processing is very suitable in mapreduce examples. Assume that a face recognition algorithm needs an image to recognize a series of desired features and generate a set of recognition results. Let us assume that we need to recognize faces on millions of images. If all images are stored in hadoop as sequential files, you can use a simple map job to implement parallel processing. In this example, the input key/value is ImageID/Image, and the output key/value is the ImageID/feature identification list. In addition, a group of identifiable features must be distributed to all mapper (for example, distributed cache ).

Face recognition jobs

Mapper In this job, mapper first initializes the object with a recognizable feature set. For each image, a map function calls the face recognition algorithm through its image itself and a recognizable list. The recognition result is output from the map together with the original imageID.
Result The execution result of this job is all images identified in the original image.

Note: mappers/reducers must be completely independent. Each mapper/reducer in the mapreduce application needs to create an independent output file. This means that the execution results of face recognition jobs are a group of files (under the same directory), each containing the output of their respective er. If you need to put them into a single file. A separate CER must be added to the face recognition job. This CER is very simple. In this example, each key used as the reduce input has only one unique value (Here we assume that the image ID is unique ), reducer only writes the Input key/value to the output file. In this example, although a CER is extremely simple, this additional job obviously increases the overall running time of the job. This is because the additional reducers are divided into shuffle and sort (not only in map jobs). When the number of images is very large, it will take a lot of time.

An example of this situation is to build inverted indexes. This type of problem requires all mapreduce steps to be executed. shuffle and sort are required to aggregate all the results. The following example shows how to design an application of this type.

Example of inverted index

In computer science, inverted indexes are a data framework used to store mappings between content (such as words or numbers) and its location in a document or a group of documents, see table 3-6. Inverted index is designed to achieve quick full-text search. When a document is added, the processing cost is added. inverted index data structure is a key part of a typical search engine, optimized the search speed for documents with certain words.

Document
ID Title Content
1 Popular Football is Popular in US
2 Common Sport Soccer is commonly played in Europe
3 National Sport Cricket is played all over India
... ... ...

Table 2-1: document structure

Inverted index
Term Value Document Document Document
Title Popular 1
Title Sport 1 2 3
Title Common 2
Title National 3
Content Football 1
Content Is 1 2 3
Content Popular 1
... ... ... ... ...

Table 2-2: inverted index

To create an inverted index, you can send each document (or line in the document) to mapper. Mapper can parse multiple words in the document and then output [word, word frequency] key-value pairs. CER can only be used to identify, output a list, or perform statistical summary for each word.

NoteIn chapter 9, you will learn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.