The shuffle process of Hadoop learning

Last Update:2015-11-01 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Turn from: http://langyu.iteye.com/blog/992916, thanks for sharing, learning Hadopp performance tuning can pay more attention to

shuffle process is the core of MapReduce, Also known as the place where miracles occur, the normal meaning of shuffle is shuffling or cluttering, and perhaps everyone is more familiar with the Collections.shuffle (List) method in the Java API, which randomly disrupts the order of elements in the parameter List. If you don't know what shuffle is in MapReduce, see this picture:

in a clustered environment such as Hadoop, most map tasks and reduce Task execution is on a different node, of course, in many cases, reduce needs to cross the node to pull the map task results on other nodes, if the cluster is running a lot of jobs, then the normal execution of the task of the network resources within the cluster is very serious. This network consumption is normal, we cannot limit, can do is to maximize the reduction of unnecessary consumption. There is also a significant effect of disk IO on the job completion time, compared to memory, within the node. from the most basic requirements, our expectations of the shuffle process can be:

Pull data from the map task end completely to the reduce side.
As much as possible, reduce the unnecessary consumption of bandwidth when pulling data across nodes.
Reduce the impact of disk IO on task execution.

the best place to optimize is to reduce the amount of data pulled and try to use memory instead of disk, take WordCount as an example, and assume it has 8 map tasks and 3 reduce tasks. As you can see, the shuffle process spans both the map and the reduce, so I'll start with two parts.
Let's take a look at the map side, such as:

The entire process I took four steps, each map task has a memory buffer, store the map output, when the buffer is nearly full, the buffer data needs to be stored in a temporary file to disk, when the entire map task to the disk after the map All temporary files generated by the task are merged, the final official output file is generated, and then the reduce task is waited to pull the data.
Of course, each step here may contain several steps and details, which I would like to explain in detail one by one:
1, when the map task executes, its input data comes from the block of HDFs, of course, in the MapReduce concept, the map task read only Split,split and block the corresponding relationship may be many to one, the default is a single. In the WordCount example, assume that the input data for the map is a string such as "AAA".
2, after mapper operation, we learned that the output of mapper is such a Key/value pair: key is "AAA", value is the value 1. Because the current map end only adds 1 to the operation, the result set is merged in the reduce task. Before we knew that the job had 3 reduce tasks, it was time to decide which reduce the current "AAA" should be assigned to.
3. MapReduce provides Partitioner interface, it is the function of the key or value and reduce the number to determine the current output data should ultimately be sent to which reduce task processing. By default, the key hash is then taken as the number of reduce task, the default mode is only for the average reduce the processing power, if the user has a demand for partitioner, can be customized and set up to the job.
4, in our example, "AAA" after Partitioner return 0, that is, the value should be left to the first reducer to deal with. Next, you need to write the data to the memory buffer, the role of the buffer is to collect the map results in bulk, reduce the impact of disk IO, our key/value and partition results will be written to the buffer, of course, before writing, the key and value values will be serialized into a byte array.
5, the entire memory buffer is a byte array, its byte index and key/value storage structure I have not studied. If a friend has a study of it, please describe it in more detail.
This memory buffer is a size limit, the default is 100MB, when the output of the map task is very large, it is possible to burst the memory, so it is necessary to temporarily write the data in the buffer to disk, and then re-use the buffer. The process of writing data from memory to disk is called spill, and Chinese can be translated as overflow, which literally means intuitive. This overflow is done by a separate thread that does not affect the thread that writes the map result to the buffer, and the overflow thread should not block the output of the map result when it is started. So the entire buffer has an overflow ratio of spill.percent, which defaults to 0.8, that is, when the buffer data has reached the threshold (buffer size * spill percent = 100MB * 0.8 = 80MB), the overflow thread starts, Locking this 80MB of memory, performing the overflow process, the output of the MAP task can also be written in the remaining 20MB memory, do not affect each other.
when the overflow thread starts, the key in this 80MB space needs to be sorted (sort), and the order is the default behavior of the MapReduce model, where the ordering is also the ordering of the serialized bytes.
Here we can think of, because the output of the map task needs to be sent to a different reduce side, and the memory buffer does not merge the data that will be sent to the same reduce end, then this merge should be reflected in the disk file, It is also possible to see from the official map that the overflow file written to the disk is a combination of the values of the different reduce ends, so a very important detail of the spill process is that if there are many key/value that need to be sent to a reduce end, these key/need to be The value values are stitched together to reduce the index records associated with partition .
when merging data for each reduce end, some data might look like this: "AAA"/1, "AAA"/1. For the WordCount example, simply count the number of occurrences of a word, and if there are many keys like "AAA" appearing in the results of the same map task, we should combine their values into one piece, and this process is called reduce also called combine. but in the terms of MapReduce, reduce refers to the process by which the reduce side performs a calculation from multiple map task fetching data. In addition to reduce, the informal merger of data can only be counted as combine, in fact, you know, MapReduce will combiner equivalent to reducer.
If the client is set to Combiner, then now is the time to use Combiner, will have the same key of the key/value pair of value together, reduce the amount of data overflow to disk . Combiner optimizes the intermediate results of MapReduce , so it is used multiple times throughout the model. which scenes can use combiner? From this analysis, theoutput of combiner is reducer input, combiner can not change the final calculation results. So from my point of view, combiner should only be used in scenarios where the input key/value of reduce is exactly the same as the output Key/value type and does not affect the final result. such as accumulation, the maximum value and so on. The use of combiner must be prudent, if used well, it is useful for job execution efficiency, which will affect the final result of reduce .
each time spill will generate a spill file on disk, if the map output is really large, there are many such spill occurs, the corresponding on the disk will have multiple spill files exist, when the map task is really completed, The data in the memory buffer is also completely overset to the disk to form an overflow file . There will be at least one such overflow file in the final disk (if the map output is very small, only one overflow file is generated when map execution is complete), because the final file is only one, so the overflow files need to be merged together, and this process is called merge. What is the merge? As in the previous example, "AAA" reads from a map task with a value of 5, and the value from another map reads 8 because they have the same key, so they have to merge into group. What is group. For "AAA" is like this: {"AAA", [5, 8, 2, ...]}, the values in the array are read from different overflow files, and then add them together. Note that because the merge is merging multiple overflow files into one file , there may also be the same key exists, in which case the client sets the combiner and uses combiner to merge the same key.
At this point, all the work on the map end is finished, and the resulting file is also stored in a local directory that Tasktracker can reach, and each reduce task continuously obtains the information from Jobtracker from the map task through RPC, If the reduce task is notified that the map task on a tasktracker is complete, the second half of the shuffle process starts.
To put it simply, the task of thereduce task is to continuously pull the final result of each map task in the current job, then merge the data pulled from different places and eventually form a file as the input file for the reduce task. See:

such as map-side details, shuffle on the reduce end of the process can be shown on the figure three points to summarize. The current reduce copy data is premised on the fact that it wants to obtain from Jobtracker which map task has been executed, and all the time before Reducer really runs is to pull the data, do the merge, and do it repeatedly. As in the previous way, I also describe the shuffle details of the reduce side in a segmented manner:
1, the copy process, simply pull the data. The reduce process starts some data copy threads (Fetcher) and requests the tasktracker of the map task to get the output file for the maps tasks by HTTP, because the map task is already finished, These files are Tasktracker managed on the local disk.
2, the merge phase, where the merge like the map end of the merge action, but the array is stored in the different map-side copy of the value. The copied data will be placed in the memory buffer, where the buffer size is more flexible than the map end, it is based on the JVM heap size setting, because the shuffle phase reducer not run, so should be the majority of memory to shuffle use, It should be emphasized here that the merge has three forms: 1) memory to memory 2) memory to disk 3) disk to disk. The first form is not enabled by default, which makes people more confused, right. When the amount of data in memory reaches a certain threshold, it starts the memory-to-disk merge. Similar to the map side, this is also the process of spill, if you set combiner, it will be enabled, and then generate a lot of spill files on the disk, the second merge mode has been running until there is no data on the map end, Then start the third disk-to-disk merge mode to generate the final file.
3, reducer input files, constantly merge, and finally generate a "final file." Why do you add quotes? Because this file may exist on disk, it may also exist in memory. For us, of course, want it to be stored in memory, directly as the input of reducer, but by default, this file is stored on the disk, when the reducer input file is determined, the entire shuffle end, then reducer execution, the results are placed on the HDFs.

The shuffle process of Hadoop learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More