Large data processing model and MapReduce

Source: Internet
Author: User
Keywords Large data extensions programmers

MapReduce has adopted a solution that is almost entirely different from the traditional http://www.aliyun.com/zixun/aggregation/14345.html "> Data processing Mode" in dealing with large data problems. It completes by running the tasks that need to be handled in parallel on multiple commercial computer nodes in the cluster. MapReduce has a number of basic theoretical ideas in the realization of large data processing, although these basic theories and even the realization of methods are not necessarily mapreduce, but they are used in a unique way by MapReduce and the Glory again.

Traditional data processing model

MapReduce Data processing Model

(1) Outward extension (Scale out) rather than upward expansion (Scale up): The processing of large data is more suitable for a large number of low-end commercial servers (Scale out) rather than a small number of high-end server (Scale up). The latter is the system-enhancing way of scaling up, it usually takes a host with an SMP architecture, while a high-end server with a large number of CPU slots (hundreds of thousands) and a large amount of shared memory (up to hundreds of GB) is expensive, but its performance growth is non-linear and therefore cost-effective. And a large number of low-end commercial server low-cost, easy to change and telescopic features effectively avoid the expansion of the spacious side.

(2) Hypothetical failures are common (assume failures are common): failures are unavoidable and pervasive at the Data Warehouse architecture level. If the average probability of a server failure is 1000 days and 1 times, then 10000 of these servers will be up to 10 times a day. Therefore, a well-designed and fault-tolerant service must be able to overcome the problems caused by a very common hardware failure, that is, failure can not cause inconsistency or uncertainty in the application level of the user. The MapReduce programming model can be robust to system or hardware failures through a series of mechanisms such as automatic restart of tasks.

(3) Moving the handler to the data 處理: in traditional high-performance computing applications, supercomputers typically have two roles, processing nodes (處理 node) and storage nodes (storage node). They complete the interconnection through high-capacity devices. However, most data-intensive processing work does not require much processing power, so separating computing from storage makes the network a bottleneck in system performance. To overcome the problem of computing such classes, MapReduce combines computing and storage in its architecture, and puts the data processing work directly in the location of the data store, but this requires the support of the Distributed File system.

(4) Sequential processing of data and avoidance of random access (process data sequentially and avoid random access): Large data processing usually means that the mass of the quantity is difficult to load all memory, and therefore must be stored on disk. However, the congenital defect of mechanical disk seek operation makes random data access a very expensive operation, so it is urgent to avoid random data access and complete data organization with sequential processing. While solid-state disks avoid some of the defects in mechanical disks, their high prices and random access problems that have not been eliminated still fail to bring about a performance leap. MapReduce is mainly designed to complete batch operations on massive datasets, that is, all computations are organized into long streaming operations, with delays in exchange for larger throughput capabilities.

(5) Hide system level details: In the program development, the professional programmer recognized one of the difficult problems is to keep track of the various details of short-term memory, simple such as variable name, complex, such as algorithm; This will have a large memory load because it requires programmers to be highly focused during the development process, A variety of development environments (Ides) were later developed to help programmers solve such problems to some extent. The process of developing distributed programs is more complex, and programmers must coordinate various details between multiple threads, processes, and even hosts, where the most troubling problem is that distributed programs run in unpredictable order and data access in unpredictable patterns. This must greatly increase the likelihood of competition conditions, deadlocks and other notorious problems. Traditionally, the solution to such problems is to use the underlying devices such as mutexes, and to apply a design pattern like "producer-consumer" queues at the top level, but distributed programs designed in this way are extremely difficult to understand and difficult to debug. The MapReduce programming model isolates programmers from the process details at the bottom of the system by providing a simple and well-defined interface for a small number of components within it. MapReduce implements the "what" and "how to parallel operations in multiple nodes" isolation, the former can be controlled by programmers, the latter is entirely controlled by the MapReduce programming framework or runtime environment.

(6) Seamless extension (seamless scalability): An extended algorithm (scalable algorithm) is the core element of data-intensive processing applications. An ideal extension algorithm should satisfy two characteristics: when the data is expanded by one time, the growth rate of its processing time will not be one times longer than the time required for the original processing; second, when the cluster expands by one time, its processing time is reduced by at least one times. Further, the ideal extension algorithm should also be able to handle a variety of scales such as PB-level data, and run well in clusters of all sizes, such as thousands of nodes, and regardless of the size of the cluster, the size of the data, the program does not need to make changes, even configuration parameters do not need to change. However, the reality is brutal, this ideal algorithm does not exist, Fred Brook in its classic "human-Moon myth" has an assertion: adding programmers behind scheduled projects will only delay the completion time of the project. This is because it is not possible to get linear extensions by simply splitting complex tasks into smaller tasks and assigning them in parallel, i.e. "a woman can have a child in 10 months, but 10 women do not have children within one months". However, this assertion has been mapreduce at least in some of this field. One of the most exciting features of--mapreduce is its ability to cope linearly with the increase in nodes, that is, when the cluster grows by n times the length of its processing the same size data is shortened by N times.

References:

Http://en.wikipedia.org/wiki/Big_data

Http://www.datameer.com/product/big-data.html

Original link: http://mageedu.blog.51cto.com/4265610/1105727

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.