Introduction to map-Reduce

Source: Internet
Author: User
Tags new set

Mapreduce is a programming model that begins with Dean, Jeffrey & Ghemawat, Sanjay (2004). "mapreduce: simplified data processing on large clusters ". It is mainly used for parallel operations on large-scale datasets. It simplifies parallel computing into the process of MAP and reduce, which greatly facilitates programmers to run their programs on the distributed system without distributed parallel programming. The programmer only needs to specify a map function to map a set of key-value pairs into a new set of key-value pairs, and then specify the concurrent reduce function, it is used to ensure that all key-value pairs mapped share the same key group.

Mapreduce is rooted in map and reduce functions in functional programming. It consists of two operations that may contain many instances (many map and reduce. The map function accepts a set of data and converts it to a list of key/value pairs. Each element in the input field corresponds to a key/value pair. Reduce functions accept the list generated by the map function, and then narrow down the list of key/value pairs based on their keys (generate a key/value pair for each key. The process concept diagram is as follows:

A typical map-reduce process is as follows:

Input-> map-> patition-> reduce-> output

Input phase

Input data must be transmitted to mapper in a certain format. There are many formats, and data is generally distributed across multiple machines.

MAP Phase

Processes the input data and outputs a set of keys and values.

Partition phase

Divides the intermediate results output by mapper tasks into r copies in the range of keys (r is the number of pre-defined reduce tasks). The default partitioning algorithm is "(key. hashcode () & integer. max_value) % numpartitions ", which ensures that keys in a range must be processed by a certain CER Cer.

Reduce phase

Reducer obtains the intermediate results of mapper output and processes a key range as input.

Output phase

The reducer output format corresponds to the Mapper input format. Of course, the reducer output can be processed as another mapper input.

Advantages and disadvantages of mapreduce:

There are two main aspects:
1. the Distributed Processing Framework of mapreduce not only can be used to process large-scale data, but also can hide a lot of complicated details, such as automatic parallelization, Server Load balancer, and disaster recovery management, this will greatly simplify the development of programmers;
2. mapreduce is highly scalable. That is to say, every time a server is added, it can connect almost the same computing capability to the cluster. Most of the distributed processing frameworks in the past have, it is far from mapreduce in terms of scalability. The biggest disadvantage of mapreduce is that it does not meet the needs of real-time applications.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.