& nbsp; For large-scale data processing, MapReduce has the following three basic design ideas.
1 deal with big data parallel processing: divide and rule
If a big data can be divided into data blocks with the same calculation process, and there is no data dependency between these data blocks, the best way to increase the processing speed is to adopt the "divide and conquer" strategy to parallelize the calculation. MapReduce uses this "divide and conquer" design idea, data with no or no data dependencies between each other with a certain data partitioning method of data segmentation, and then each data slice by a node To deal with the final summary of the results.
Rise to the abstract model: Map and Reduce
(1) Map and Reduce in Lisp language
MapReduce draws on the design idea of functional programming language Lisp. Lisp is a list processing language. It is a symbolic language used in artificial intelligence processing, invented by MIT's Artificial Intelligence Specialist, Turing Award winner John McCarthy in 1958.
Lisp defines various actions that can be applied to the entire list element, such as:
(Add # (1 2 3 4) # (4 3 2 1)) produces the result: # (5 5 5 5)
Lisp also provides operations similar to Map and Reduce, such as:
(Map'vector # + # (1 2 3 4) # (4 3 2 1))
Add the two vectors by defining an additive map operation to produce the same result # (5 5 5 5) as the add operation above.
Further, Lisp can also define a reduce operation for some sort of merge operation, such as:
(Reduce # '+ # (1 2 3 4)) Summation by addition produces cumulative result 10.
(2) Map and Reduce in MapReduce
In order to overcome this shortcomings, MapReduce draws on the idea of Lisp functional language and provides high-level parallel programming abstraction model and interface with Map and Reduce functions. As long as the programmer implements The two basic interfaces to quickly complete the design of parallel programs.
Just as the Lisp language can be used to process list data, MapReduce is designed to handle the data elements / records of a sequence of organizations. In real life, big data is often composed of a set of duplicate data elements / records. For example, a web access log file data consists of a large number of repetitive access logs. The processing of such sequential data elements / records is usually It is also a sequential scan process. Figure 1-13 depicts the process and characteristics of a typical sequential big data process:
MapReduce abstracts the above process as two basic operations, abstracts the first two steps in the above process as Map operations and the latter two steps as Reduce operations. So the Map operation is mainly responsible for some kind of repeated processing of a set of data records, and the Reduce operation is mainly responsible for some further result collation and output on the Map intermediate result. In this way, MapReduce provides an abstraction mechanism for the main processing operations in big data processing.
3. Rise to the framework: a unified framework for programmers to hide the system layer details
MPI and other parallel computing methods lack a unified computing framework support, programmers need to consider many details such as data storage, partitioning, distribution, result collection, error recovery; MapReduce designed and provides a unified computing framework for programmers to hide Most system-level processing details require programmers to focus on the application and the algorithm itself without having to focus on the details of other system layers and greatly reduce the burden on the programmer in developing the program.
The main goal of MapReduce's Unified Computing Framework is to automate parallelization calculations and hide system-level details for programmers. This unified framework is responsible for automating the underlying system-level processing of:
1) Calculate the task of automatic division and scheduling.
2) Automated data storage and partitioning.
3) Processing data and computing tasks synchronization.
4) Collection of results data (sorting, combining, partitioning, etc.).
5) system communication, load balancing, computing performance optimization.
6) processing system node error detection and failure recovery.