MapReduce basic design ideas

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp ; design ideas programmers provide

Tags basic big data block computing data data processing data storage design

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

& nbsp; For large-scale data processing, MapReduce has the following three basic design ideas.

1 deal with big data parallel processing: divide and rule

If a big data can be divided into data blocks with the same calculation process, and there is no data dependency between these data blocks, the best way to increase the processing speed is to adopt the "divide and conquer" strategy to parallelize the calculation. MapReduce uses this "divide and conquer" design idea, data with no or no data dependencies between each other with a certain data partitioning method of data segmentation, and then each data slice by a node To deal with the final summary of the results.

Rise to the abstract model: Map and Reduce

(1) Map and Reduce in Lisp language

MapReduce draws on the design idea of functional programming language Lisp. Lisp is a list processing language. It is a symbolic language used in artificial intelligence processing, invented by MIT's Artificial Intelligence Specialist, Turing Award winner John McCarthy in 1958.

Lisp defines various actions that can be applied to the entire list element, such as:

(Add # (1 2 3 4) # (4 3 2 1)) produces the result: # (5 5 5 5)

Lisp also provides operations similar to Map and Reduce, such as:

(Map'vector # + # (1 2 3 4) # (4 3 2 1))

Add the two vectors by defining an additive map operation to produce the same result # (5 5 5 5) as the add operation above.

Further, Lisp can also define a reduce operation for some sort of merge operation, such as:

(Reduce # '+ # (1 2 3 4)) Summation by addition produces cumulative result 10.

(2) Map and Reduce in MapReduce

In order to overcome this shortcomings, MapReduce draws on the idea of Lisp functional language and provides high-level parallel programming abstraction model and interface with Map and Reduce functions. As long as the programmer implements The two basic interfaces to quickly complete the design of parallel programs.

Just as the Lisp language can be used to process list data, MapReduce is designed to handle the data elements / records of a sequence of organizations. In real life, big data is often composed of a set of duplicate data elements / records. For example, a web access log file data consists of a large number of repetitive access logs. The processing of such sequential data elements / records is usually It is also a sequential scan process. Figure 1-13 depicts the process and characteristics of a typical sequential big data process:

MapReduce abstracts the above process as two basic operations, abstracts the first two steps in the above process as Map operations and the latter two steps as Reduce operations. So the Map operation is mainly responsible for some kind of repeated processing of a set of data records, and the Reduce operation is mainly responsible for some further result collation and output on the Map intermediate result. In this way, MapReduce provides an abstraction mechanism for the main processing operations in big data processing.

3. Rise to the framework: a unified framework for programmers to hide the system layer details

MPI and other parallel computing methods lack a unified computing framework support, programmers need to consider many details such as data storage, partitioning, distribution, result collection, error recovery; MapReduce designed and provides a unified computing framework for programmers to hide Most system-level processing details require programmers to focus on the application and the algorithm itself without having to focus on the details of other system layers and greatly reduce the burden on the programmer in developing the program.

The main goal of MapReduce's Unified Computing Framework is to automate parallelization calculations and hide system-level details for programmers. This unified framework is responsible for automating the underlying system-level processing of:

1) Calculate the task of automatic division and scheduling.

2) Automated data storage and partitioning.

3) Processing data and computing tasks synchronization.

4) Collection of results data (sorting, combining, partitioning, etc.).

5) system communication, load balancing, computing performance optimization.

6) processing system node error detection and failure recovery.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More