http://langyu.iteye.com/blog/992916
Shuffle intention is to shuffle the meaning. Described in MapReduce is how the output of a map task is effectively transmitted to the reduce task side.
In a clustered environment such as Hadoop, most map tasks and the reduce task are executed on different nodes . Of course, in many cases, the reduce will need to cross-node to pull the map task results on the other nodes . If the cluster is running a lot of jobs, then the normal execution of the task will be very serious to the network resources inside the cluster. This network consumption is normal, we cannot limit, can do is to maximize the reduction of unnecessary consumption. There is also a significant effect of disk IO on the job completion time, compared to memory, within the node.
From the most basic requirements, our expectations of the shuffle process can include:
- Pull data from the map task end completely to the reduce side.
- As much as possible, reduce the unnecessary consumption of bandwidth when pulling data across nodes.
- Reduce the impact of disk IO on task execution.
My analysis is based on Hadoop0.21.0 source code, if you know the shuffle process is different, not hesitate to point out. I'll take wordcount as an example and assume it has 8 map tasks and 3 reduce tasks. As you can see, the shuffle process spans both the map and the reduce, so I'll start with two parts.
Let's take a look at the map side, such as:
The official figure does not clearly state what stage partition, sort and combiner, actually function. I drew this diagram to make it clear that all the data from the map data input to the map end are ready for the whole process.
I took four steps to complete the process. It's easier to say that each map task has a memory buffer that stores the output of the map, and when the buffer is almost full, it needs to store the buffer's data in a temporary file to the disk, and when the entire map task ends, the map All temporary files generated by the task are merged, the final official output file is generated, and then the reduce task is waited to pull the data.
1. When the map task executes, its input data originates from the block of HDFs, of course, in the MapReduce concept, the map task only reads split. The corresponding relationship between split and block may be many-to-one, and the default is to single. In the WordCount example, assume that the input data for the map is a string such as "AAA".
2. After mapper operation, we learned that the output of mapper is such a Key/value pair: key is "AAA", value is the value 1. Because the current map end only adds 1 to the operation, the result set is merged in the reduce task. before we knew that the job had 3 reduce task, which reduce the current "AAA" should be left to do , is the time to decide now
MapReduce provides the Partitioner interface, which is based on the number of key or value and reduce to determine which reduce task the current output data should ultimately be processed. By default, the key hash is then modeled with the number of reduce tasks. The default mode is only for the average reduce the processing power, if the user himself to Partitioner have the demand, can be customized and set up to the job.
In our example, "AAA" returns 0 after partitioner, meaning that the pair value should be left to the first reducer to handle. Next, the data needs to be written to the memory buffer, which is a buffer that collects map results in batches, reducing the impact of disk IO. The results of our key/value and partition will be written to the buffer. Of course, the key and value values are serialized into byte arrays before they are written.
The entire memory buffer is a byte array, its byte index and key/value storage structure I have not studied. If a friend has a study of it, please describe it in more detail.
3. This memory buffer is limited by size and is 100MB by default. When the output of a map task is large, it is possible to burst the memory, so you need to temporarily write the data in the buffer to disk in a certain condition, and then reuse the buffer. The process of writing data from memory to disk is called spill, and Chinese can be translated as overflow, which literally means intuitive. This overflow is done by a separate thread, without affecting the thread that writes the map result to the buffer. When the overflow thread starts, the result output of map should not be blocked, so the entire buffer has an overflow ratio of spill.percent. This ratio defaults to 0.8, that is, when the buffer data has reached the threshold (buffer size * spill percent = 100MB * 0.8 = 80MB), the overflow thread starts, locks this 80MB of memory, and executes the overflow process. The output of the MAP task can also be written in the remaining 20MB memory, with no effect.
When the overflow thread starts, the key in this 80MB space needs to be sorted (sort). Sorting is the default behavior of the MapReduce model, where sorting is also the sort of serialized bytes.
Here we can think of, because the output of the map task needs to be sent to the different reduce side, and the memory buffer does not merge the data that will be sent to the same reduce end, then this merge should be embodied in the disk file. An overflow file written to disk can also be seen on the official map to merge the values of different reduce ends. So an important detail of the overflow process is that if there are a lot of key/value that need to be sent to a reduced end, then these key/value values need to be stitched together to reduce the index records associated with partition.
When merging data for each reduce end, some data might look like this: "AAA"/1, "AAA"/1. For the WordCount example, simply count the number of occurrences of a word, and if there are many keys like "AAA" appearing in the results of the same map task, we should combine their values into one piece, and this process is called reduce also called combine. But in the terms of MapReduce, reduce refers to the process by which the reduce side performs a calculation from multiple map task fetching data. In addition to reduce, the informal merging of data can only be counted as combine. In fact, you know, MapReduce will equate combiner with reducer.
If the client is set to Combiner, then it is time to use combiner. Add the value of the key/value pair with the same key to reduce the amount of data that overflows to disk. Combiner optimizes the intermediate results of mapreduce, so it is used multiple times throughout the model. Which scenes can use combiner? From this analysis, the output of combiner is reducer input, combiner can not change the final calculation results. So from my point of view, combiner should only be used in scenarios where the input key/value of reduce is exactly the same as the output Key/value type and does not affect the final result. such as accumulation, the maximum value and so on. The use of combiner must be prudent, if used well, it is useful for job execution efficiency, which will affect the final result of reduce.
4: Each overflow write will generate an overflow file on the disk, if the map output is really large, there are several times such an overflow occurs, the disk corresponding to a plurality of overflow file exists. When the map task is really complete, the data in the memory buffer is also all overflowing into the disk to form an overflow file. There will be at least one such overflow file in the final disk (if the map output is very small, only one overflow file is generated when map execution is complete), because the final file is only one, so the overflow files need to be merged together, and this process is called merge. What is the merge? As in the previous example, "AAA" reads from a map task with a value of 5, and the value from another map reads 8 because they have the same key, so they have to merge into group. What is group. For "AAA" is like this: {"AAA", [5, 8, 2, ...]}, the values in the array are read from different overflow files, and then add them together. Note that because the merge is merging multiple overflow files into one file, there may also be the same key exists, in which case the client sets the combiner and uses combiner to merge the same key.
At this point, all the work on the map end is finished, and the resulting file is also stored in a local directory that Tasktracker can reach. Each reduce task continuously obtains from Jobtracker the information about whether the map task is completed through RPC, and if the reduce task is notified that a map task on a tasktracker is completed, The second half of the shuffle process starts.
To put it simply, the task of the reduce task is to continuously pull the final result of each map task in the current job, then merge the data pulled from different places and eventually form a file as the input file for the reduce task. See:
such as map-side details, shuffle on the reduce end of the process can be shown on the figure three points to summarize. The current reduce copy data is premised on the fact that it wants to obtain from Jobtracker which map task has been executed, this process is not table, and interested friends can follow. Before Reducer really runs, all the time is pulling data, doing the merge, and doing it repeatedly. As in the previous way, I also describe the shuffle details of the reduce side in a segmented manner:
1.Copy process, simply pull the data. The reduce process starts some data copy threads (Fetcher) and requests the tasktracker of the map task to get the output file of the maps tasks by HTTP. Because the map task is already finished, these files are Tasktracker managed on the local disk.
2. Merge stage. The merge here, such as the map end of the merge action, is only stored in the array of different map-side copy values. The copied data will first be placed in the memory buffer, where the buffer size is more flexible than the map end, it is based on the JVM heap size setting, because the shuffle phase reducer not run, so should be the majority of memory to shuffle. It should be emphasized here that the merge has three forms: 1) memory to memory 2) memory to disk 3) disk to disk. The first form is not enabled by default, which makes people more confused, right. When the amount of data in memory reaches a certain threshold, it starts the memory-to-disk merge. Similar to the map end, this is also an overflow process, this process if you set up a combiner, it will be enabled, and then on the disk generated a large number of overflow files. The second merge mode is running until the data at the map end is not finished, and then the third disk-to-disk merge mode is generated to generate the final file.
3.Reducer input file. After a continuous merge, a "final file" is eventually generated. Why do you add quotes? Because this file may exist on disk, it may also exist in memory. For us, of course we want it to be stored in memory, directly as input to the reducer, but by default, this file is stored on disk. As for how to make this file appear in memory, after the performance optimization chapter I say. When the reducer input file is set, the entire shuffle is finally finished. Then the reducer executes, putting the results on HDFs.
Detailed shuffle process (reprint)