Big Data operation Model MapReduce principle2016-01-24 Du Yishu
MapReduce is a parallel operation model of a large data set, proposed by Google, and the use of MapReduce as a computational model in today's popular Hadoop
MapReduce Popular explanation
The library to count the number of books, there are 10 shelves, the administrator to speed up the statistics, to find 10 students, each student is responsible for the statistics of a bookshelf book number
Zhang Classmate Statistics Bookshelf 1
Wang Classmate Statistics Bookshelf 2
Liu Classmate Statistics Bookshelf 3
......
After a while, 10 students in succession to the administrator to report their own statistics, the administrator put together the numbers, they got the total number of books
This process can be understood as the working process of MapReduce.
There are two core operations in MapReduce
(1) Map
The administrator assigns which classmate counts which bookshelf, each student carries on the same "statistic" operation, this process is the map
(2) Reduce
The administrator summarizes the results of each classmate, the process is reduce
Work process disassembly of MapReduce
Here's a classic case (word count) to see how MapReduce works.
There was a text file that was divided into 4 parts, which were stored in 4 servers.
Text 1:the weather is good
Text 2:today is good
Text 3:good weather is good
Text 4:today has good weather
Requirements : Count the number of occurrences of each word
Processing process
01
word processing
Map Node 1
Input: (Text1, "The weather is good")
Output: (the, 1), (weather, 1), (is, 1), (good, 1)
Map Node 2
Input: (Text2, "Today is good")
Output: (Today, 1), (is, 1), (good, 1)
Map Node 3
Input: (Text3, "Good weather is good")
Output: (good, 1), (weather, 1), (is, 1), (good, 1)
Map Node 4
Input: (Text3, "Today has good weather")
Output: (Today, 1), (has, 1), (good, 1), (weather, 1)
02
Sort
Map Node 1
Map Node 2
Map Node 3
Map Node 4
03
Merge
Map Node 1
Map Node 2
Map Node 3
Map Node 4
04
Summary statistics
MapReduce introduced the concept of barrier , some translated as "synchronization Barrier", which I understand as "dividing line", is a dividing line into reduce
barrier 's role is to combine the combined results
For example, using 3 reduce nodes, you need to regroup the results of the 4 map nodes above, put the same words together, and assign them to 3 reduce nodes
Reduce node statistics to calculate the final result
Big Data operation Model MapReduce principle