Hadoop is getting more and more hot, and the sub-projects around Hadoop are growing fast, with more than 10 of them listed on the Apache website, but original aim, most of the projects are based on Hadoop Common.
MapReduce is the core of the core. So what exactly is MapReduce, and how does it work in particular?
About its principle, said simple also simple, casually draw a picture to spray a map and reduce two stages seems to be over. But it also contains a lot of sub-stages, especially shuffle, many of which are called "the Heart" of MapReduce, and the so-called "miracle Place." There are not so many people who can really speak clearly about the relationship. But understanding these processes is very useful for us to understand and master MapReduce and to tune it.
First we look at a picture, which contains the entire process from beginning to end, and the explanations for all the steps are taken as references (this figure 100% original)
In simple terms, it means that there is a series of processes between our common map and reduce, including partition, Sort, Combine, copy, merge, etc., which are often collectively referred to as "Shuffle" or "mixed wash". The purpose of the shuffle is to comb, sort, and distribute the data to each reducer in a scientific way so that it can be efficiently calculated and processed (no wonder people say that this is where miracles happen, so there are so many flowers in it.) )
If you are a Hadoop Daniel, look at this picture may soon be jumping out, no! There is a spill process ...
Wait, about spill, I think just a realization of the details, in fact, mapreduce use memory buffer to improve efficiency, the whole process and principle is not affected, so ignore the spill process in order to better understand.
It's still a little confusing to see the schematics, right? That's right! I always thought that no example of the article is bullying, so we use the familiar wordcount as an example, to begin our discussion.
First create two text files as input to our example:
File1 content is:
[HTML]View PlainCopy
- My name is Tony
- My Company is pivotal
File2 content is:
[HTML]View PlainCopy
- My name is Lisa
- My Company is EMC
Step One: Map () Processing of the input shards
First of all, our input is two files, which by default is two split, corresponding to the split0,split1 in the previous figure. Two split default will be assigned to two mapper to deal with, wordcount example is quite violent, this step is to break down the contents of the file into words and 1, in which the word is our key, the following number is the corresponding value, that is, " It is assumed that all of you have wordcount procedures in mind. "
Then the output of the corresponding two mapper is:
The data processed by SPLIT0 is:
[HTML]View PlainCopy
- My 1
- Name 1
- is 1
- Tony 1
- My 1
- Company 1
- is 1
- Pivotal 1
The data processed by SPLIT1 is:
[HTML]View PlainCopy
- My 1
- Name 1
- is 1
- Lisa 1
- My 1
- Company 1
- is 1
- EMC 1
Step Two: Partition the output of map ()
What is partition? partition is the partition.
Why partition? Because sometimes there will be a number of reducer,partition is in advance of the input processing, according to the future reducer partition, then when the reducer processing, only need to deal with their own data to be able to.
How to partition? The main partitioning method is to separate the data according to key, one of the most important thing is to ensure the uniqueness of the key, because in the future do reduce is likely to be done at different nodes, if a key at the same time there are two nodes, reduce the results will be a problem, So the common partition method is hashing.
With our example, we assume that there are two reducer, and the first two split finishes partition the result will be as follows:
Split0 data is then partitioned after map ()
Partition 1:
[HTML]View PlainCopy
- Company 1
- is 1
- is 1
Partition 2:
[HTML]View PlainCopy
- My 1
- My 1
- Name 1
- Pivotal 1
- Tony 1
Note: The hash is by key and the number of reducer (set to 2 here) is modulo, so the result can only be two.
Split1 data is then partitioned (with Split0) after map ():
Partition 1:
[HTML]View PlainCopy
- Company 1
- is 1
- is 1
- EMC 1
Partition 2:
[HTML]View PlainCopy
- My 1
- My 1
- Name 1
- Lisa 1
Note: One of the partition1 is for Reducer1 processing, Partition2 is for Reducer2 processing. As we can see here, partition just puts all the entries in a single area, without any other processing, and the key in each zone does not appear in another area.
Step Three: Sort
Sort is sorted, in fact, this process in my opinion is not necessary, can be given to the client's own program to handle. Then why do they have to be sorted? It may be that the researchers who wrote MapReduce thought, "Most of the reduce programs should want to enter data that has been sorted by key, and if so, then we'll just do it for you!" ”
So let's assume that the previous data is sorted again, and the results are as follows:
The data in Split0 's partition 1 is sorted as follows:
[HTML]View PlainCopy
- Company 1
- is 1
- is 1
The data in Split0 's partition 2 is sorted as follows:
[HTML]View PlainCopy
- My 1
- My 1
- Name 1
- Pivotal 1
- Tony 1
The data in Split1 's partition 1 is sorted as follows:
[HTML]View PlainCopy
- Company 1
- EMC 1
- is 1
- is 1
The data in Split1 's partition 2 is sorted as follows:
[HTML]View PlainCopy
- Lisa 1
- My 1
- My 1
- Name 1
Note: Here you can see that the entries in each partition are sorted in the order of key.
Fourth Step: Combine
What is combine? Combine can actually be understood as a mini reduce process, which occurs after the output of the previous map, the purpose is to send the results to reducer before the first calculation, in order to reduce the size of the file, convenient for subsequent transmission. But this is not a necessary step.
Follow the previous output to execute the Combine:
The data for Split0 's partition 1 are:
[HTML]View PlainCopy
- Company 1
- is 2
The data for Split0 's Partition 2 are:
[HTML]View PlainCopy
- My 2
- Name 1
- Pivotal 1
- Tony 1
The data for the Split1 Partition1 are:
[HTML]View PlainCopy
- Partition 1:
- Company 1
- EMC 1
- is 2
The data for SPLIT1 's Partition 2 are:
[HTML]View PlainCopy
- Lisa 1
- My 2
- Name 1
Note: For the previous output, we have partially counted the occurrence frequency of IS and my, reducing the size of the output file.
Fifth Step: Copy
Now we're ready to send the output to reducer. This stage is called copy, but in fact I think it is more appropriate to call it download, because when implemented, it is done by HTTP, by the Reducer node to each mapper node to download the data belonging to its own partition.
So according to the previous partition, the results of the download are as follows:
Reducer Node 1 has a total of two files (split0 Partition1 and split1 Partition1):
[HTML]View PlainCopy
- Partition 1:
- Company 1
- is 2
- Partition 1:
- Company 1
- EMC 1
- is 2
Reducer Node 2 is also two files (Partition2 of split0 Partition1 and split1):
[HTML]View PlainCopy
- Partition 2:
- My 2
- Name 1
- Pivotal 1
- Tony 1
- Partition 2:
- Lisa 1
- My 2
- Name 1
Note: With copy, the same partition data falls on the same node.
Sixth step: Merge
As shown in the previous step, the files that reducer get are downloaded from different mapper and need to be merged into one file, so the following is the merge and the result is as follows:
The data for Reducer Node 1 is as follows:
[HTML]View PlainCopy
- Company 1
- Company 1
- EMC 1
- is 2
- is 2
The data for Reducer Node 2 is as follows:
[HTML]View PlainCopy
- Lisa 1
- My 2
- My 2
- Name 1
- Name 1
- Pivotal 1
- Tony 1
Note: The map end also has the merge process, which occurs in the Ring buffer section.
Seventh Step: Reduce processing
Finally we can make the final reduce, this step is quite simple, according to the content of each file last to do a statistic, the results are as follows:
Reducer Data for Node 1:
[HTML]View PlainCopy
- Company 2
- EMC 1
- is 4
Reducer Data for Node 2:
[HTML]View PlainCopy
- Lisa 1
- My 4
- Name 2
- Pivotal 1
- Tony 1
That's it! We successfully counted the number of each word in two files, and deposited them into two output files, the two output files are legendary part-r-00000 and part-r-00001, look at the contents of two files, and then look back to the beginning of the partition, Should be clear of the mystery of it.
If you run the WordCount in your own environment only part-r-00000 a file, it should be because you are using the default settings, the default one job has only one reducer.
If you want to set up two, you can:
1. add Job.setnumreducetasks (2) to the source code and set the reducer for this job to two.
Or
2. set the following parameters in the mapred-site.xml and restart the service
[HTML]View PlainCopy
- <property>
- <name>mapred.reduce.tasks</name>
- <value>2</value>
- </Property>
If set in the configuration file, the entire cluster will use two reducer by default.
Conclusion:
This article outlines the entire process of mapreduce and what happens at each stage, and does not involve specific job,resource management and control, because that is the main difference between the first-generation MapReduce framework and the yarn framework. The above-mentioned MapReduce principle in the two-generation framework is similar.
Note: The article is from here. In addition, this article does not clarify the map end ring buffer spill process, for details please refer to: MapReduce Shuffle detailed
Detailed description of the MapReduce process