The principle and design idea of MapReduce

Source: Internet
Author: User

An interesting example of a simple explanation of the MapReduce algorithm

You want to count the number of spades in a stack of cards. The intuitive way is a single check and count out how many are spades?

The MapReduce method is:

    1. Assign this stack of cards to all the players present

    2. Let each player count the number of cards in his hand there are spades, and then report this number to you

    3. You add up the numbers that all the players tell you and get the final conclusion.

Split

MapReduce incorporates two classical functions:

    • Mapping (Mapping) applies the same action to each target in the collection. That is, if you want to multiply each cell in the form by two, the action of applying the function individually to each cell is mapping.

    • Simplification (reducing) iterates through the elements in the collection to return a synthesized result. That is, the output form contains a list of numbers and this task belongs to reducing.

Re-examine the example above

Re-examining our original example of scattered solitaire, we have the basic method of MapReduce data analysis. Friendly reminder: This is not a rigorous example. In this example, people represent computers because they work at the same time, so they are a cluster . In most practical applications, we assume that the data is already on each computer-that is, distributing the cards is not a mapreduce step. (In fact, how to store files in a computer cluster is the real core of Hadoop.) )

By giving the cards to multiple players and letting them count each other, you perform the operation in parallel because each player counts at the same time. This also turns the job into a distributed one, because many different people do not need to know what their neighbors are doing to solve the same problem.

By telling everyone to count, you map a task that examines each card. You're not going to let them pass you the spades, but let them simplify the thing you want into a number.

Another interesting situation is how evenly the cards are distributed. The MapReduce hypothesis is that the data is washed (shuffled)-If all spades are divided into one hand, the process of counting cards may be much slower than others.

If there are enough people, it's fairly simple to ask more interesting questions-such as "What is the average of a stack of cards (21-point algorithm)". You can get the answer by merging the "What's the value of all cards" and "How many cards we have". Use this and divide by the number of cards to get the average.

The mechanism of the mapreduce algorithm is much more complicated than this, but the main idea is the same-the large amount of data is analyzed by the decentralized computation. Whether it's Facebook, NASA, or a small startup, MapReduce is the mainstream way to analyze Internet-level data today.

This article address: http://www.cnblogs.com/archimedes/p/mapreduce-principle.html, reprint please indicate source address.

MapReduce in Hadoop

The basic concept of mapreduce at three levels in large-scale data processing

How to deal with big data processing: Divide and conquer

The most natural way to achieve parallelism is to take a divide-and-conquer approach to big data that does not have computational dependencies on each other.

Ascent to abstract Model: Mapper and Reducer

In order to overcome this shortcoming, the parallel computing method of MPI lacks the high-level parallel programming model, and MapReduce uses the idea of Lisp functional language to provide a high-level parallel programming abstract model with map and reduce two functions.

Rise to the frame: unify the architecture and hide the system-level details for programmers

Parallel computing methods such as MPI lack unified computing framework support, programmers need to consider data storage, partitioning, distribution, results collection, error recovery and many other details; for this purpose, MapReduce designs and provides a unified computing framework that hides most of the system-level processing details for programmers

1. Dealing with big data processing-divide and conquer

What kinds of computing tasks can be parallelized calculated?

The first important question in parallel computing is how to divide the computational tasks or calculate the data so that the sub-tasks or blocks of data are computed simultaneously. But some computational problems just can't be divided!

Nine women cannot has a baby in one month!

For example: Fibonacci function: fk+2 = Fk + fk+1

There is a strong dependency between the front and back data items! Only serial calculation!

Conclusion: Non-detachable computing tasks or interdependent data cannot be calculated in parallel!

Parallel computing of Big data

If a big data can be divided into data blocks with the same computational process, and there is no data dependency between the blocks, the best way to improve processing speed is to parallel computing

For example, suppose you have a huge 2-dimensional data that needs to be processed (for example, to open a cube for each element), where each element is treated the same, and there is no data dependency between the data elements, you can consider different partitioning methods to divide it into sub-arrays, which are processed in parallel by a set of processors

2. Building abstract models-map and reduce

The design thought of Lisp from functional design language

? Functional programming (functional programming) Language Lisp is a list-processing language (lists processing), a symbolic language applied to artificial intelligence processing, by MIT's AI expert, Turing Award winner John McCarthy was designed and invented in 1958.

? Lisp defines the various actions that can be used to process a list element as a whole, such as:

such as: (Add # (1 2 3 4) # (4 3 2 1) will produce results: # (5 5 5 5)

? Lisp also provides operations similar to map and reduce

such as: (Map ' Vector #+ # (1 2 3 4 5) # (10 11 12 13 14))

Add 2 vectors by defining an addition map operation to produce a result # (11 13 15 17 19)

(Reduce # ' + # (11 13 15 17 19)) Generate cumulative result by adding merge 75

Map: Perform some sort of repetitive processing of a set of data elements

Reduce: Some further results for the intermediate results of the map

Key idea: Provide an abstraction mechanism for the two main processing operations in the process of large data processing

Abstract description of map and reduce operations in MapReduce

MapReduce draws on the idea of the functional programming language Lisp, defines the following map and reduce two abstract programming interfaces, which the user is programmed to implement:

? Map: (k1; v1) → [(K2; v2)]

Input: Data represented by key-value pairs (k1; v1)

Processing: A document data record (such as a line in a text file, or a row in a data table) is passed into the map function as a "key-value pair", and the map function handles these key-value pairs and outputs a set of key-value pairs that are processed in a different key-value pair [(K2; v2)]

Output: A set of intermediate data represented by a key value pair [(K2; v2)]

? Reduce: (K2; [v2]) → [(K3; v3)]

Input: A set of key-value pairs that are output by map [(K2; v2)] will be merged to combine different values under the same primary key into a list [v2], so that the input of reduce is (K2; [v2])

Processing: some sort or further processing of the incoming intermediate result list data, and produce the final output of some form of results [(K3; v3)].

Output: Final output [(K3; v3)]

Map and reduce provide programmers with a clear description of the operating interface abstraction

? Each map function processes the data in parallel, producing different intermediate result outputs from different input data.

? Each reduce also computes in parallel, each dealing with a different set of intermediate result data. Before you can do reduce processing, you must wait until all the map functions are done, so you need a synchronization barrier (barrier) before entering reduce. This stage is also responsible for collecting and collating the intermediate result data of map (aggregation & Shuffle) so that reduce can calculate the final result more effectively? Final results are aggregated for all reduce outputs

MapReduce-based process examples -document word frequency statistics: WordCount

There are 4 sets of original text data:

Text 1: The weather is good Text 2: today is good

Text 3: good weather is good text 4: today have good weather

Traditional serial processing mode (Java):

string[] Text =Newstring[] {"Hello World", "Hello every One", "Say hello to everyone in the"}; HashTable HT=NewHashTable ();  for(i = 0; i < 3; + +)i) {StringTokenizer St=NewStringTokenizer (Text[i]);  while(St.hasmoretokens ()) {String Word=St.nexttoken (); if(!Ht.containskey (Word)) {Ht.put (Word,NewInteger (1)); } Else {            intWC = ((Integer) ht.get (Word)). Intvalue () +1;//count plus 1Ht.put (Word,NewInteger (WC)); }    }} for(Iterator itr=ht.  KeySet (). iterator (); Itr.hasnext (); ) {String Word=(String) itr.next (); System.out.print (Word+ ":" + (Integer) ht.get (word) +"; ");}

Output:good:5;  has:1;   Is:3;   The:1;    Today:2; Weather:3

MapReduce-based process examples -document word frequency statistics: WordCount

MapReduce processing methods

Use 4 map nodes:

Map Node 1:

Input: (Text1, "The weather is good")

Output: (the, 1), (weather, 1), (is, 1), (good, 1)

Map Node 2:

Input: (Text2, "Today is good")

Output: (Today, 1), (is, 1), (good, 1)

Map Node 3:

Input: (Text3, "Good weather is good")

Output: (good, 1), (weather, 1), (is, 1), (good, 1)

Map Node 4:

Input: (Text3, "Today has good weather")

Output: (Today, 1), (has, 1), (good, 1), (weather, 1)

Use 3 reduce nodes:

MapReduce processing methods

MapReduce pseudo-code (implement map and reduce two functions):

Class Mapper Method Map (String Input_key, String input_value)://Input_key:text Document name//input_value:document Contents   forEach word W in input_value:emitintermediate (W,"1"); Class Reducer method Reduce (String Output_key, Iterator intermediate_values)://output_key:a Word//output_values:a List of counts  intresult = 0;  forEach V in Intermediate_values:result+=parseint (v); Emit (Output_key, result);
3. Rise to the frame-automatically parallelize and hide low-level details

How to provide a unified computing framework

MapReduce provides a unified computing framework that can be accomplished by:

? Calculation of task partitioning and scheduling

? data distribution Storage and partitioning

Handle the synchronization of data and compute tasks

Data collection and collation of results (sorting, combining, partitioning,...)

? system communication, load balancing, computational performance optimization processing

? handling system node error detection and fail-over recovery

The biggest highlight of MapReduce

? What needs to be done with the abstract model and the computational framework (what need to do) is separate from what exactly is done, and provides the programmer with an abstract and high-level programming interface and framework

? Programmers only need to be concerned with the specific computational problems of their application layer, and only need to write a small amount of program code to deal with the application's own computational problems.

How to complete this parallel computing task is related to many of the system layer details are hidden, to the computational framework to deal with: from the execution of distribution code, up to thousands of small to a single node cluster automatic scheduling use

The main functions provided by MapReduce

? Task Scheduler: A computed Job (job) submitted will be divided into a number of computing tasks (tasks), the task scheduling function is mainly responsible for these divided computing task allocation and scheduling compute nodes (map node or reducer node); At the same time responsible for monitoring the execution status of these nodes, and responsible for the map node execution synchronization control (barrier); Also responsible for some of the computational performance optimization processing, such as the slowest computing task with multi-backup execution, select the fastest complete as a result

? Data/Code Interoperability: A fundamental principle in order to reduce datacom is localized data processing (locality), where a compute node processes the data stored on its local disk as much as possible, which enables the migration of code to data, and when such localized data processing is not possible, Look for other available nodes and transfer data from the network to that node (data to Code migration), but try to reduce communication latency by looking for available nodes from the local rack where the data resides

Error handling: In a large-scale mapreduce compute cluster with low-end commercial servers, node hardware (host, disk, memory, etc.) error and software bug is the norm, so mapreducer needs to be able to detect and isolate the faulty node and dispatch a new node to take over the compute task of the faulty node? Distributed data storage and file management: Massive data processing requires a well-distributed data storage and file management system support, the file system can be stored in a large amount of information on the local disk node, but to maintain the entire data logically become a complete data file; In order to provide a data storage fault tolerance mechanism, The file system also provides a multi-backup storage management capability for data blocks? Combiner and Partitioner: In order to reduce data communication overhead, intermediate results are required to be merged (combine) before they enter the reduce node, and data with the same primary key can be combined to avoid duplicate transmission; The data processed by a reducer node may come from multiple map nodes, so the intermediate results of the map node output need to be appropriately partitioned (partitioner) using a certain strategy to ensure that the relevant data is sent to the same reducer node

A parallel computing model based on map and reduce

Main design ideas and features of 4.MapReduce

1. Extend horizontally to "outside" instead of "up" vertically (scale ' out ', not ' up ')

That is, the construction of the MapReduce cluster chooses the inexpensive and easy to expand high-end commercial server, rather than the expensive, not easy to expand the higher-side server (SMP)? The low-end server market has overlapping markets with high-capacity desktop PCs, so because of competing prices, interchangeable parts, and economies of scale, so that low-end servers keep prices low? Based on the performance evaluation results of TPC-C at the end of 2007, a low-end server platform compared to a high-end server platform with shared memory structure, its cost-performance is about 4 times times higher than the external memory price, the low-end server cost is about 12 times times better For large-scale data processing, due to the large number of storage needs, it is obvious that the low-end server-based cluster is far superior to a high-end server-based cluster, which is why the MapReduce parallel computing cluster is based on the low-end server implementation

2. failure is considered normal (assume failures is common)  

The MapReduce cluster uses a large number of low-end servers (Google currently uses more than millions of server nodes worldwide), so node hardware failure and software error are the norm, so:? A well-designed, fault-tolerant parallel computing system cannot affect the quality of computing services due to node failure. Failure of any node should not lead to inconsistent or uncertain results; When any one node fails, the other nodes should be able to seamlessly take over the computing task of the failed node; When the failed node resumes, it should be able to automatically join the cluster seamlessly, without requiring the administrator to manually configure the system? The MapReduce Parallel Computing software framework uses a variety of effective mechanisms, such as node automatic restart technology, so that the cluster and computing framework can deal with the robustness of node failure, and effectively handle the detection and recovery of failed nodes.

3, the processing to the data migration (moving processing to the information)

? Traditional high performance computing systems often have many processor nodes connected to some external memory nodes, such as a disk array connected by a regional storage network (san,storage area networks), so external memory file data I/O access at large scale data processing becomes a bottleneck that restricts system performance. In order to reduce data traffic overhead in large-scale data parallel computing systems, and instead to transfer it to processing nodes (data to processor or code migration), consideration should be given to moving the processing toward the data and migrating it. MapReduce uses the technique of data/code interoperability, the compute node will first be responsible for computing its locally stored data to play the data localization feature (locality), only when the node cannot process the local data, and then use the nearest principle to find other available compute nodes, and transfer the data to the available compute nodes.

4. process data sequentially, avoid random access data (process data sequentially and avoid random access)

? The characteristics of large-scale data processing determine that a large number of records can not be stored in memory, but may only be placed in the external memory for processing. There is a huge difference in the performance of sequential access and immediate access to disks

Example: 10 billion (1010) database of data records (100B per record, 1TB total)

It takes 1 months to update 1% of records (must be random access), and it takes only 1 days to sequentially access and rewrite all data records!

? MapReduce is designed to be a parallel computing system for large data set batches, and all calculations are organized into long streaming operations to take advantage of the high transmission bandwidth distributed across a large number of nodes on a cluster of disk sets.

5. Hide the system layer details for app developers (hide system-level details from the application developer)

? In the Software Engineering Practice Guide, professional programmers think that writing a program is difficult because programmers need to remember too much programming details (from variable names to the boundary of complex algorithms), which is a huge cognitive burden for brain memory and requires a high degree of concentration. And parallel programming has more difficulty, If you need to consider the complex details such as synchronization in multi-threading, because of the unpredictability in concurrent execution, the debugging of the program is very difficult; in large-scale data processing, programmers need to consider such details as data distribution storage management, data distribution, datacom and synchronization, and computational results collection. MapReduce provides an abstraction that isolates the programmer from the system-level details, and the programmer simply describes what needs to be computed (what to compute), and how to do it (how to compute) is handled by the system's execution framework, This frees the programmer from the details of the system layer, and is devoted to the algorithm design of the computational problem of its own application.

6. Smooth and seamless scalability (seamless scalability)

It mainly includes two levels of extensibility: data expansion and system scale expansion. The ideal software algorithm should be able to show continuous effectiveness as the scale of the data expands, and the decrease in performance should be commensurate with the scale of the data. On the scale of the cluster, It is required that the computational performance of the algorithm should be as close to the linear increase as the number of nodes increases. Most of the existing stand-alone algorithms do not achieve the above ideal requirements; The single-machine algorithm that maintains the intermediate result data in memory quickly expires in large-scale data processing  , from a single machine to a large-scale cluster-based parallel computing fundamentally requires a completely different algorithm design? Amazingly, MapReduce achieves the above ideal extensibility features. Several studies have found that computational performance based on MapReduce can be maintained approximately linearly with increasing number of nodes

Resources

1, http://blog.jobbole.com/79255/

2. The introduction of MapReduce

The principle and design idea of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.