MapReduce's main features and technical features

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp ; programmer extension algorithm

Tags .mall abstract access application application layer applications based code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.MapReduce the main function

MapReduce uses abstract models and computational frameworks to separate what need to do from how to do and provides programmers with an abstract and high-level programming interface and framework that programmers need only care about Application layer specific calculation problems, only need to write a small amount of processing applications to calculate the problem of the program code; how to complete the parallel computing tasks related to the many system layer details are hidden up to the computing framework to deal with: from the distribution of code Implementation, to large to thousands, as small as the number of node cluster automatic scheduling.

MapReduce provides the following key features:

1) Data Division and Calculation Task Scheduling: The system automatically divides a job's pending big data into many data blocks, each corresponding to a computing task and automatically schedules the computing nodes to process the corresponding The data block. The job and task scheduling functions are responsible for allocating and scheduling compute nodes (Map nodes or Reduce nodes) and also for monitoring the execution status of these nodes and for the synchronization control performed by the Map nodes.

2) Data / Code Inter-Targeting: In order to reduce data communication, one of the basic principles is to localize data processing, that is, a compute node processes data distributed and stored on its local disk as much as possible, which enables code to data migration; When doing this kind of localized data processing, it looks for other available nodes and sends the data over the network to the node (data to code migration), but will try to find available nodes from the local rack where the data resides to reduce the communication delay .

3) System Optimization: In order to reduce the data communication overhead, the intermediate result data will be merged before entering the Reduce node. The data processed by one Reduce node may come from multiple Map nodes. In order to avoid data relevancy during the Reduce calculation phase, The intermediate results output by the Map node need to be properly classified by some strategies to ensure that the relevancy data is sent to the same Reduce node. In addition, the system performs some computational performance optimization processes such as multiple backup execution for the slowest computing task , Choose the fastest finisher as a result.

4) Error Detection and Recovery: In a large-scale MapReduce computing cluster composed of low-end commercial servers, it is normal for hardware (host, disk, memory, etc.) errors and software errors in the cluster. Therefore, MapReduce needs to detect and isolate faulty nodes and schedule Assign a new node to take over the computing task of the error node. At the same time, the system will also maintain the reliability of data storage, with redundant backup storage mechanism to improve the reliability of data storage, and timely detection and recovery of erroneous data.

2.MapReduce the main technical characteristics

MapReduce design has the following main technical features:

1) Extend horizontally to "outside" instead of vertically to "up."

That is to say, MapReduce cluster is constructed entirely of low-cost and easy-to-extend low-end commercial servers rather than expensive high-end servers that are not easily scalable. For large-scale data processing, it is clear that low-end server-based clusters are far superior to high-end server-based clusters due to the large amount of data storage required, which is why MapReduce parallel computing clusters are based on low-end server implementations.

2) Failure is considered normal

Because of the large number of low-end servers used in the MapReduce cluster, hardware failure and software errors on the nodes are normal. Therefore, a well-designed, highly fault-tolerant parallel computing system can not compromise the quality of service computing due to node failures. Should result in inconsistency or uncertainty; when any node fails, other nodes should be able to seamlessly take over the computing task of the failed node; when the failed node is restored, it should be able to join the cluster automatically and seamlessly without the need for the administrator to manually perform the system Configuration. MapReduce parallel computing software framework uses a variety of effective error detection and recovery mechanisms, such as node auto-restart technology, cluster and computing framework to deal with the robustness of node failure, can effectively handle the detection and recovery of the failed node.

3) Transfer processing to data

Traditional high-performance computing systems usually have many processor nodes connected to some external memory nodes, such as a disk array connected by a storage area network (SAN). Therefore, when large-scale data processing is performed, the memory file data I / O access Will become a bottleneck restricting system performance. In order to reduce the data communication overhead in a large-scale data parallel computing system and instead send the data to the processing nodes (data to the processor or code migration), consideration should be given to moving the processing closer to the data. MapReduce adopts the technical method of data / code mutual positioning. Compute nodes will try their best to calculate the data stored locally so as to exert the data localization feature. Only when the node can not process the local data, the nearest principle is used to find other available compute nodes , And pass the data to the available compute node.

4) The order of processing data, to avoid random access to data

The characteristics of large-scale data processing determines that a large amount of data records are difficult to store in memory, and usually can only be stored in external memory for processing. Because disk sequential access is much faster than random access, MapReduce is primarily designed for disk-oriented processing of sequential large-scale data. In order to achieve high-throughput parallel processing of large data sets for batch processing, MapReduce can utilize a large number of data storage nodes in a cluster to simultaneously access data to provide high-bandwidth data access and transmission using a set of disks on a large number of nodes in a distributed cluster .

5) Hidden system layer details for application developers

In the Software Engineering Practice Guide, the reason why professional programmers write programs is that programmers need to remember too much programming details (from variable names to the boundaries of complex algorithms), which is a huge recognition of brain memory Knowing the burden requires a high degree of concentration; parallel programs are more difficult to write, such as the need to consider complex and tedious details such as synchronization in multithreading. Due to the unpredictability of concurrent execution, it is also very difficult to debug the program. Moreover, the programmer needs to consider many details such as data distribution and storage management, data distribution, data communication and synchronization, and calculation result collection during large-scale data processing . MapReduce provides an abstraction mechanism that separates programmers from system-level details. Programmers only need to describe what needs to be computed, and how to compute is handled by the system's execution framework , So programmers can be liberated from the system level details and devoted to the algorithmic design of their own computational problems.

6) Smooth and seamless scalability

The scalability pointed out here mainly includes two levels of scalability: data expansion and system scalability. The ideal software algorithm should be able to demonstrate continuous effectiveness as the data size increases, and the degree of performance degradation should be comparable to the multiple of the data size expansion. On the scale of the cluster, it is required that the algorithm performance should be calculated as the number of nodes The increase is maintained at a nearly linear increase. The vast majority of existing stand-alone algorithms do not meet the above ideal requirements; the stand-alone algorithm that maintains intermediate result data in memory fails quickly on large-scale data processing; parallel computing from stand-alone to large-scale clustering fundamentally On the need for completely different algorithm design. Amazingly, MapReduce can achieve the above ideal scalability features in many situations. Several studies have found that for many computational problems, MapReduce-based computational performance can maintain a near-linear increase with the number of nodes.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More