Mapreduce is a programming model that begins with Dean, Jeffrey & Ghemawat, Sanjay (2004). "mapreduce: simplified data processing on large clusters ". It is mainly used for parallel operations on large-scale datasets. It simplifies parallel computing into the process of MAP and reduce, which greatly facilitates programmers to run their programs on the distributed system without distributed parallel programming. The programmer only needs to specify a map function to map a set of key-value pairs into a new set of key-value pairs, and then specify the concurrent reduce function, it is used to ensure that all key-value pairs mapped share the same key group.
Mapreduce is rooted in map and reduce functions in functional programming. It consists of two operations that may contain many instances (many map and reduce. The map function accepts a set of data and converts it to a list of key/value pairs. Each element in the input field corresponds to a key/value pair. Reduce functions accept the list generated by the map function, and then narrow down the list of key/value pairs based on their keys (generate a key/value pair for each key. The process concept diagram is as follows:
A typical map-reduce process is as follows:
Input-> map-> patition-> reduce-> output
Input phase
Input data must be transmitted to mapper in a certain format. There are many formats, and data is generally distributed across multiple machines.
MAP Phase
Processes the input data and outputs a set of keys and values.
Partition phase
Divides the intermediate results output by mapper tasks into r copies in the range of keys (r is the number of pre-defined reduce tasks). The default partitioning algorithm is "(key. hashcode () & integer. max_value) % numpartitions ", which ensures that keys in a range must be processed by a certain CER Cer.
Reduce phase
Reducer obtains the intermediate results of mapper output and processes a key range as input.
Output phase
The reducer output format corresponds to the Mapper input format. Of course, the reducer output can be processed as another mapper input.
Advantages and disadvantages of mapreduce:
There are two main aspects:
1. the Distributed Processing Framework of mapreduce not only can be used to process large-scale data, but also can hide a lot of complicated details, such as automatic parallelization, Server Load balancer, and disaster recovery management, this will greatly simplify the development of programmers;
2. mapreduce is highly scalable. That is to say, every time a server is added, it can connect almost the same computing capability to the cluster. Most of the distributed processing frameworks in the past have, it is far from mapreduce in terms of scalability. The biggest disadvantage of mapreduce is that it does not meet the needs of real-time applications.