MapReduce: Detailed introduction to Shuffle's execution process

Source: Internet
Author: User
Tags shuffle

The shuffle process is the core of MapReduce, also known as the place where miracles occur. To understand mapreduce, shuffle must be understood. I have seen a lot of relevant information, but every time I read the foggy around, it is difficult to sort out the general logic, but the more stirred mixed. The first time in the work of the MapReduce job performance tuning, need to drill down into the code to study the operation mechanism of MapReduce, this just to shuffle. I was annoyed when I looked at the information before, so I tried my best to make shuffle clear, so that every friend who wants to know its principle can gain. If you have any questions or suggestions on this article please leave a message to the back, thank you!

The normal meaning of shuffle is shuffling or messing up, and perhaps everyone is more familiar with the Collections.shuffle (List) method in the Java API, which randomly disrupts the order of elements in the parameter List. If you don't know what shuffle is in MapReduce, take a look at this picture:


This is the official description of the shuffle process. But I can be sure that from this diagram you will not be able to understand the process of shuffle, because it is quite different from the facts, the details are also disordered. I'll describe the facts of shuffle in the following, so you just need to know the approximate range of shuffle-how to effectively transfer the output of the map task to the reduce side. It can also be understood that shuffle describes the process of data from the map task output to the reduce task input.

In a clustered environment such as Hadoop, most map tasks and the reduce task are executed on different nodes. Of course, in many cases, the reduce will need to cross-node to pull the map task results on the other nodes. If the cluster is running a lot of jobs, then the normal execution of the task will be very serious to the network resources inside the cluster. This network consumption is normal, we cannot limit, can do is to maximize the reduction of unnecessary consumption. There is also a significant effect of disk IO on the job completion time, compared to memory, within the node. From the most basic requirements, our expectations of the shuffle process can include:
Pull data from the map task end completely to the reduce side.
As much as possible, reduce the unnecessary consumption of bandwidth when pulling data across nodes.
Reduce the impact of disk IO on task execution.
OK, when you see this, you can stop and think about it, if you are designing this shuffle process yourself, then what is your design goal? The main thing I want to optimize is to reduce the amount of data pulled and try to use memory instead of disk.
My analysis is based on Hadoop0.21.0 source code, if you know the shuffle process is different, not hesitate to point out. I'll take wordcount as an example and assume it has 8 map tasks and 3 reduce tasks. As you can see, the shuffle process spans both the map and the reduce, so I'll start with two parts.
Let's take a look at the map side, such as:


May be the operation of a map task. Compare it to the left half of the official chart and you'll find a lot of inconsistencies. The official figure does not clearly state what stage partition, sort and combiner, actually function. I drew this diagram to make it clear that all the data from the map data input to the map end are ready for the whole process.

I took four steps to complete the process. It's easier to say that each map task has a memory buffer that stores the output of the map, and when the buffer is almost full, it needs to store the buffer's data in a temporary file to the disk, and when the entire map task ends, the map All temporary files generated by the task are merged, the final official output file is generated, and then the reduce task is waited to pull the data.

Of course, each step here may contain several steps and details, which I would like to explain in detail one by one:
1, when the map task executes, its input data originates from the block of HDFs, of course, in the MapReduce concept, the map task only reads split. The corresponding relationship between split and block may be many-to-one, and the default is to single. In the WordCount example, assume that the input data for the map is a string such as "AAA".

2, after mapper operation, we learned that the output of mapper is such a Key/value pair: key is "AAA", value is the value 1. Because the current map end only adds 1 to the operation, the result set is merged in the reduce task. Before we knew that the job had 3 reduce tasks, it was time to decide which reduce the current "AAA" should be assigned to.
MapReduce provides the Partitioner interface, which is based on the number of key or value and reduce to determine which reduce task the current output data should ultimately be processed. By default, the key hash is then modeled with the number of reduce tasks. The default mode is only for the average reduce the processing power, if the user himself to Partitioner have the demand, can be customized and set up to the job.

In our example, "AAA" returns 0 after partitioner, meaning that the pair value should be left to the first reducer to handle. Next, the data needs to be written to the memory buffer, which is a buffer that collects map results in batches, reducing the impact of disk IO. The results of our key/value and partition will be written to the buffer. Of course, the key and value values are serialized into byte arrays before they are written.

The entire memory buffer is a byte array, its byte index and key/value storage structure I have not studied. If a friend has a study of it, please describe it in more detail.
  
3, this memory buffer is a size limit, the default is 100MB. When the output of a map task is large, it is possible to burst the memory, so you need to temporarily write the data in the buffer to disk in a certain condition, and then reuse the buffer. The process of writing data from memory to disk is called spill, and Chinese can be translated as overflow, which literally means intuitive. This overflow is done by a separate thread, without affecting the thread that writes the map result to the buffer. When the overflow thread starts, the result output of map should not be blocked, so the entire buffer has an overflow ratio of spill.percent. This ratio defaults to 0.8, that is, when the buffer data has reached the threshold (buffer size * spill percent = 100MB * 0.8 = 80MB), the overflow thread starts, locks this 80MB of memory, and executes the overflow process. The output of the MAP task can also be written in the remaining 20MB memory, with no effect.

When the overflow thread starts, the key in this 80MB space needs to be sorted (sort). Sorting is the default behavior of the MapReduce model, where sorting is also the sort of serialized bytes.

Here we can think of, because the output of the map task needs to be sent to the different reduce side, and the memory buffer does not merge the data that will be sent to the same reduce end, then this merge should be embodied in the disk file. An overflow file written to disk can also be seen on the official map to merge the values of different reduce ends. So an important detail of the overflow process is that if there are a lot of key/value that need to be sent to a reduced end, then these key/value values need to be stitched together to reduce the index records associated with partition.

When merging data for each reduce end, some data might look like this: "AAA"/1, "AAA"/1. For the WordCount example, simply count the number of occurrences of a word, and if there are many keys like "AAA" appearing in the results of the same map task, we should combine their values into one piece, and this process is called reduce also called combine. But in the terms of MapReduce, reduce refers to the process by which the reduce side performs a calculation from multiple map task fetching data. In addition to reduce, the informal merging of data can only be counted as combine. In fact, you know, MapReduce will equate combiner with reducer.

If the client is set to Combiner, now is the time to use combiner. Add the value of the key/value pair with the same key to reduce the amount of data that overflows to disk. Combiner optimizes the intermediate results of mapreduce, so it is used multiple times throughout the model. Which scenes can use combiner? From this analysis, the output of combiner is reducer input, combiner can not change the final calculation results. So from my point of view, combiner should only be used in scenarios where the input key/value of reduce is exactly the same as the output Key/value type and does not affect the final result. such as accumulation, the maximum value and so on. The use of combiner must be prudent, if used well, it is useful for job execution efficiency, which will affect the final result of reduce.

4, each overflow write will generate an overflow file on the disk, if the map output is really large, there are many times such an overflow occurred, the disk will have a number of overflow file exists. When the map task is really complete, the data in the memory buffer is also all overflowing into the disk to form an overflow file. There will be at least one such overflow file in the final disk (if the map output is very small, only one overflow file is generated when map execution is complete), because the final file is only one, so the overflow files need to be merged together, and this process is called merge. What is the merge? As in the previous example, "AAA" reads from a map task with a value of 5, and the value from another map reads 8 because they have the same key, so they have to merge into group. What is group. For "AAA" is like this: {"AAA", [5, 8, 2, ...]}, the values in the array are read from different overflow files, and then add them together. Note that because the merge is merging multiple overflow files into one file, there may also be the same key exists, in which case the client sets the combiner and uses combiner to merge the same key.

At this point, all the work on the map end is finished, and the resulting file is also stored in a local directory that Tasktracker can reach. Each reduce task continuously obtains from Jobtracker the information about whether the map task is completed through RPC, and if the reduce task is notified that a map task on a tasktracker is completed, The second half of the shuffle process starts.

To put it simply, the task of the reduce task is to continuously pull the final result of each map task in the current job, then merge the data pulled from different places and eventually form a file as the input file for the reduce task. See:


such as map-side details, shuffle on the reduce end of the process can be shown on the figure three points to summarize. The current reduce copy data is premised on the fact that it wants to obtain from Jobtracker which map task has been executed, this process is not table, and interested friends can follow. Before Reducer really runs, all the time is pulling data, doing the merge, and doing it repeatedly. As in the previous way, I also describe the shuffle details of the reduce side in a segmented manner:

1, the copy process, simply pull the data. The reduce process starts some data copy threads (Fetcher) and requests the tasktracker of the map task to get the output file of the maps tasks by HTTP. Because the map task is already finished, these files are Tasktracker managed on the local disk.

2, the merge stage. The merge here, such as the map end of the merge action, is only stored in the array of different map-side copy values. The copied data will first be placed in the memory buffer, where the buffer size is more flexible than the map end, it is based on the JVM heap size setting, because the shuffle phase reducer not run, so should be the majority of memory to shuffle. It should be emphasized here that the merge has three forms: 1) memory to memory 2) memory to disk 3) disk to disk. The first form is not enabled by default, which makes people more confused, right. When the amount of data in memory reaches a certain threshold, it starts the memory-to-disk merge. Similar to the map end, this is also an overflow process, this process if you set up a combiner, it will be enabled, and then on the disk generated a large number of overflow files. The second merge mode is running until the data at the map end is not finished, and then the third disk-to-disk merge mode is generated to generate the final file.

3, reducer input file. After a continuous merge, a "final file" is eventually generated. Why do you add quotes? Because this file may exist on disk, it may also exist in memory. For us, of course we want it to be stored in memory, directly as input to the reducer, but by default, this file is stored on disk. As for how to make this file appear in memory, after the performance optimization chapter I say. When the reducer input file is set, the entire shuffle is finally finished. Then the reducer executes, putting the results on HDFs.


MapReduce: Detailed description of the execution of shuffle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.