My understanding of mapreduce

Source: Internet
Author: User
Tags sorts

Mapreduce: simplified data processing on large clusters

Abstract: This paper should be regarded as the opening of mapreduce. In general, the content of this article is relatively simple. It actually introduces the idea of mapreduce. Although this idea is simple, however, it is still difficult to think of this idea directly. Furthermore, a simple idea is often difficult to implement. The purpose of mapreduce is to provide users with a simple interface for users (including users without experience in parallel programming) to use, in this way, there will be many difficulties (including how to divide the Server Load balancer, how to schedule processing, and how to handle machine errors), so the author is quite cool, however, the author did not describe these details in detail in this article. They all gave me a rough idea. Next I will explain my understanding of this article.

The execution process of mapreduce is named map and reduce. This is actually a bit similar to merging and sorting, and adopts the principle of separation and governance. MAP is responsible for allocating a large amount of data to different hosts for computing (generally through the hash method). Reduce sorts and outputs the intermediate computing results of different hosts.

For the author, the use of mapreduce adopts the idea of functional programming, that is, as long as the user provides the map and reduce methods, the rest will be handed over to the system for processing. Here, the author provides a simple example to count the number of times a word appears in the document:


Here, the map function marks the number of occurrences of each word in the input document as 1 and stores it in a temporary file; the reduce function counts the number of occurrences of each input word from all intermediate files and outputs the result. In this way, it seems very simple to use.

Well, after a brief introduction, let's take a look at the overall operating process of the system as described by the authors in the paper:


1. mapreduce automatically divides user input files into M parts, and then copies multiple copies of the Program on different machines in the cluster for execution;

2. among these machines, one is run as the master, and the other is called worker. The master is responsible for assigning tasks to other workers, A total of M map tasks and R reduce tasks need to be assigned; (PS: Actually, I think of MPI Programming, which is similar to MPI. Generally, MPI has a receiving host, this machine is responsible for returning the computing results of other machines. Of course, Mpi can also not set the receiving host );

3. The worker machine performs map computing and stores intermediate results in the buffer;

4. at the same time, the cached results are written to the hard disk, and the allocation method is assigned to the r regions (the number of R is specified by the user). Then, the intermediate results are transmitted to the master, the master is responsible for forwarding it to reduce worker (indeed similar to MPI for message transmission );

5. when reduce work receives a message, it uses a remote process to read the data cached by map, after reading the data, the reduce process sorts the data by the Intermediate keywords so that all data with the same keywords can be put together. If the data is too large, the external row is used;

6. Reduce accesses all sorted intermediate data for the keywords of each intermediate result, and then passes the key and consistent intermediate values to the User-Defined reduce;

7. When all the map and reduce tasks are completed, the master will wake up the user program.

After all the operations are completed, the result will be output in r copies. Generally, you do not need to merge the r results, because these results may be used to continue the mapreduce operation, or be passed to another distributed system for further processing.

In the above framework, several problems need to be addressed:

1. Master Data Structure:

Maste stores the status of each map and reduce task and the storage location of intermediate files.

2. Fault Tolerance

Failed to work the Host:

The master will pingwork each time. If no reply is received, the work will be down. The completed map task will be re-executed. If a work error occurs, because the intermediate results are stored on a local disk, however, when a reduce work error occurs, you do not need to re-execute the executed task because the reduce result is stored in the global file system.

Master host failed

The master creates a checkpoint at intervals. If an error occurs, it will start from the checkpoint again.

3. Location

Use the GFS file system to solve the network challenges of data transmission.

4. Task Granularity

M and R are much larger than the number of worker hosts. Server Load balancer is improved by running different tasks on each host.

5. Backup

It does not wait until all the map or reduce operations are completed. Sometimes, when a host is slow (such as a low-speed hard disk ), at this time, you can perform the next step when the mapreduce operation is to be completed, which can greatly improve the efficiency.

Basically, mapreduce is like this. In this article, the author also introduced the partition, ordering guarantee, and combiner methods in the system, I think this idea is really very good. Maybe we don't use mapreduce, but we may process a large number of records of data, so that we can use the mapreduce idea for processing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.