We already know the three core modules of Hadoop: HDFS, MapReduce, Yarn.
What is MapReduce?
MapReduce is a programming model for parallel computing of large scale datasets, the main idea being map (map) and reduce (simplification).
The idea and inspiration for MapReduce comes from functional programming, where map performs operations or functions on each element of the list. For example: Executing the Multiple-by-two function on a list [1,2,3,4] produces another list [2,4,6,8], and when executed, the original list is not changed. Functional programming believes that data should be kept immutable to avoid sharing data among multiple processes or threads. This means that although this function is simple, it can be executed concurrently on the same list by two or more threads, without affecting each other, because the list itself has not been changed.
MapReduce is used for parallel computation of massive amounts of data, which requires the assignment of work to a large number of machines, and data synchronization between data nodes makes the system inefficient and unreliable if data is shared between the components. In fact, the data element on MapReduce is immutable, and even if the change does not feed back to the input file, the communication between nodes occurs only when the new key-value pair outputs, and Hadoop passes the output key-value pairs to the next stage.
Conceptually, the MapReduce program turns the input data list into a list of output data. a MapReduce program performs two data conversion operations, one map, and one reduce. Let's give an example:
1 [email protected]2 [email protected]3 [email protected]4] [email protected]5 [email protected]6 [Email protected]7 [Email protected]8 [email protected]9] [email protected]10] [email protected]11 [email protected]12 [email protected]13 [email protected]14 [email protected]15 [email protected]16 [email protected]17 [email protected]18] [email protected]19 [ email protected]20 [email protected]21 [email protected]22 [email protected]23 [email protected]24 [email protected]25 [email protected]26 [email protected]27] [email protected]28 [ email protected]29 [email protected]30 [email protected]31 [email protected]32 [email protected]33 [email protected]34 [email protected]35 [email protected]36] [email protected]37 [ email protected]38 [email protected]39 [email protected]40 [email protected]41] [email protected]
The first step: Map. Read input file contents, parse into key, value pair. For each line of the input file, parse to key, value pair. Resolves the mailbox domain to value by ' @ '.
<21cn.com 1><126.com 1><qq.com 1><163.com 1><163.com 1><126.com 1><qq.com 1 ><gmail.com 1>, .....
Step two: Combine. Merge with the same mailbox domain.
<163.com 1 1 1 1 1 1 1 1 1 1 1 1 1 1><126.com 1 1 1 1 1 1 1 1 1><qq.com 1 1 1 1 1 1 1 1 1><sina.com 1 1><sohu.com 1 1><yahoo.com.cn 1 1><21cn.com 1><gmail.com 1><yahoo.cn 1>
Step three: Reduce. Sum the statistics for the same mailbox domain.
<163.com 14><126.com 9><qq.com 9><sina.com 2><sohu.com 2><yahoo.com.cn 2>< 21cn.com 1><gmail.com 1><yahoo.cn 1>
Well, the results we wanted came out. The idea of MapReduce is from the root, is that so, is it very simple?
The basic process of MapReduce
The MapReduce computing model consists of two phases: map and reduce, which allows the user to implement a distributed computation simply by implementing the map () and reduce () two functions.
Figure 1: Map and reduce processes for statistics on the number of graphic styles
The parameters of the two functions, map () and reduce () are key, value pairs, which represent input information for the function.
1. Map Task Processing
1.1 Read input file contents, parse into key, value pair. For each line of the input file, parse to key, value pair. Each key-value pair is called once to the map function.
1.2 Write your own logic, the input key, value processing, converted to a new key, value output.
1.3 Partition the Output key, value.
1.4 The data of different partitions are sorted and grouped according to key. The value of the same key is placed in a collection.
1.5 (optional) The data after grouping is normalized.
2.reduce Task Handling
2.1 The output of multiple map tasks, according to different partitions, through the network copy to a different reduce node.
2.2 Merge and sort the output of multiple map tasks. Write the reduce function's own logic, the input key, value processing, converted to a new key, value output.
2.3 Save the output of reduce to a file.
Figure 2:word the map and reduce process for count
The basic design idea of MapReduce
Above said so much, in fact, the design idea of MapReduce can be attributed to the following three:
(1) dealing with big data parallel processing: divide and conquer
(2) Ascent to abstract model: map and reduce
(3) rise to the frame: a unified framework for programmers to hide the system layer details
Extended *********************
classmate: bo Master, how to explain to Moe sister MapReduce?
Small Talk: First look at the Indian Java Programmer Shekhar Gulati How to explain MapReduce to his wife(Source: CSDN)
I:How did you prepare the onion chili sauce??
Wife: I'll take an onion, chop it up, then mix in salt and water, and then grind it into a hybrid grinder. So you can get the onion chili sauce. But what does this have to do with MapReduce?
Me: You wait a minute. Let me make a complete plot so that you can understand mapreduce within 15 minutes.
Wife: All right.
I:now, suppose you want a bottle of mixed chili sauce with mint, onion, tomato, chili, garlic .。 What would you do?
Wife: I'll take a pinch of mint leaves, one onion, one tomato, one chili, one garlic, chopped Salt and water, then put in a mixing grinder to grind, so you can get a bottle of mixed chili sauce.
I:Yes, let's apply the concept of mapreduce to recipes. Map and reduce are actually two kinds of operations, let me give you a detailed explanation.
Map (map):Chopping onions, tomatoes, peppers, and garlic is a map operation that acts on each of these objects. SoIf you give map an onion, the map will chop the onion. Similarly, you take chili, garlic and tomato one by one to map, and you get all kinds of pieces .。 So, when you're cutting vegetables like onions, you're doing a map operation. The map operation is suitable for each vegetable, which produces one or more fragments accordingly, and in our case the vegetable blocks are produced.in the map operation there may be a situation where an onion is broken and you just lose the bad onion. So, if a bad onion is present, the map operation will filter out the bad onion without producing any bad onion blocks.。
Reduce (simplification):At this stage, you can get a bottle of chili sauce by putting all the vegetables into the grinder for grinding. This means that to make a bottle of chili sauce, you have to grind all the ingredients. As a result, the grinder usually aggregates the vegetable pieces of the map operation.
Wife: So, this is MapReduce?
Me: You can say yes, or you can say no. In fact, this is only part of the MapReduce, and the power of MapReduce lies in distributed computing.
Wife: Distributed computing? What is that? Please explain it to me.
Me: no problem.
Suppose you took a chili sauce contest and your recipe won the best Chili Sauce award. After the award, the chili sauce recipes are popular, so you want to start selling homemade chili sauce. Suppose you need to produce 10000 bottles of chili sauce every day, what would you do?
Wife: I will find a supplier who can provide me with a lot of raw materials.
Me: Yes ... That's the way it is. Can you finish the production on your own? That is to say, the raw materials are chopped up alone? can only one grinder meet the needs? And now, we also need to supply different kinds of chili sauce, such as onion chili sauce, green pepper chili sauce, tomato chili sauce and so on.
Wife: Of course not, I will hire more workers to cut vegetables. I also need more grinders so that I can produce chili sauce more quickly.
Me: Yes, so now you have to assign a job, you will need several people to cut vegetables together. Everyone has to deal with a bag full of vegetables, and each person is equivalent to performing a simple map operation. Each person will continue to take out the vegetables from the bag and dispose of only one vegetable at a time, that is, to chop them up until the bag is empty.
In this way, when all the workers are finished, the work station (where everyone works) has onion blocks, tomato blocks, and garlic, and so on.
Wife: But how can I make different kinds of ketchup?
Me: Now you will see the phase of MapReduce omission---stirring phase. MapReduce mixes all the vegetables that have been exported, all of which are produced under the key-based map operation. The stirring will be done automatically, and you can assume that key is the name of a raw material, just like an onion. So the whole onion keys will be stirred together and transferred to the grinder that grinds the onion. In this way, you can get onion chili sauce. In the same way, all tomatoes are transferred to the grinder labeled with the tomato, and the tomato chili sauce is made.
10 minutes to see through MapReduce