The shuffle process is the core of MapReduce, also known as the place where miracles occur. To understand mapreduce,shuffle, you have to understand. I have seen a lot of relevant information, but every time I read the foggy around, it is difficult to clarify the general logic, but the more confused. Front-end time in the work of the MapReduce job performance tuning, need to drill down into the code to study the operation mechanism of MapReduce, this only to shuffle. I was annoyed when I looked at the information before, so I tried my best to make shuffle clear, so that every friend who wants to know its principle can gain. If you have any questions or suggestions on this article please leave a message to the back, thank you!
The normal meaning of shuffle is shuffling or messing up, and perhaps everyone is more familiar with the Collections.shuffle (List) method in the Java API, which randomly disrupts the order of elements in the parameter List. If you don't know what shuffle is in MapReduce, take a look at this picture:
This is the official description of the shuffle process. But I can be sure, but from this picture you basically can not understand the process of shuffle, because it is quite different from the facts, the details are quite chaotic. I will describe in detail the facts of shuffle, so here you just need to know the approximate range of the shuffle ok,shuffle the approximate range is the map end output to the data transfer to the reduce side. It can also be understood that shuffle describes the process of data output from Maptask to reducetask input.
In a clustered environment such as Hadoop, most maptask and reducetask are executed on different nodes, and of course in many cases, the reduce execution requires cross-nodes to pull the maptask results on the other nodes. If the cluster is running a lot of jobs, then the normal execution of the task will be very serious to the network resources inside the cluster. This network consumption is normal, we cannot limit, can do is to maximize the reduction of unnecessary consumption. There is also considerable impact on the job completion time in the node, compared with memory, disk IO. From the most basic requirements, our expectations of the shuffle process can include:
(1): pull the data from the map task end to the reduce side completely.
(2): when pulling data across nodes, reduce the unnecessary consumption of bandwidth as much as possible.
(3): reduce the impact of disk IO on task execution.
OK, when you see here, you can stop and think, if you are to design this shuffle process, then your actual goal is what. The main thing I can optimize is to reduce the amount of data pulled and try to use memory instead of disk.
My analysis is based on Hadoop0.21.0 source code, if it is different from the shuffle process you know, don't hesitate to point it out. I'll take wordcount as an example and assume it has 8 maptask and 3 reduce tasks. As you can see, the shuffle process spans both the map and the reduce, so I'll start with two parts.
Let's take a look at the map side, such as:
It may be the operation of some maptask. Compare it to the official left half, and you'll find a lot of inconsistencies. The official figure does not clearly indicate at what stage Partition,sort and combiner are at work . I drew this diagram to make it clear that all the data from the map data input to the map end are ready for the whole process.
I took four steps to complete the process. It's easier to say that each maptask has a memory buffer that stores the output of the map () function, and when the buffer is almost full it needs to put the buffer's data in the form of a temporary file to disk. When the entire maptask is finished, merge all temporary files generated by this maptask in the disk, generate the final official output file, and wait for the Reducetask to pull the data. Of course, each step here may contain several steps and details, which I would like to explain in detail one by one:
1. when Maptask executes, its input data originates from the block of HDFs, which of course maptask only read split in the MapReduce concept. The corresponding relationship between split and block may be many-to-one, and the default is to single. In the WordCount example, the hypothetical input data is a string like "AAA".
2. After mapper operation, we learned that the output of mapper is such a Key/value pair: key is "AAA", value is the value 1. Because the current map side does only add 1 of the operation, in the Reducetask to merge the result set. Before we knew this job had 3 reducetask, in the end the current "AAA" should be given to which reduce to do? It needs to be decided now.
MapReduce provides the partition interface, which is based on the number of key or value and reduce to determine which reducetask the current output data should ultimately be processed. By default, the key hash is then modeled on the number of reducetask. The default mode of mode knowledge in order to average reduce the processing capacity, if the user himself to Partitioner have the demand, can be customized and set up to the job.
In our example, "AAA" returns 0 after partitioner, meaning that the pair value should be left to the first reducer to handle. Next, the data needs to be written to the memory buffer , which is the function of collecting the map results in batches, and the results of our key/value and partition are written to the buffer. Of course, the key and value values are serialized into byte arrays before they are written.
The entire memory buffer is a byte array, its byte index and key/value storage structure I have not studied. If a friend has a study of it, please describe it in more detail.
3. This memory buffer is limited by size and is 100MB by default. When the output of the maptask is very large, it is possible to burst the memory, so you need to temporarily write the data in the buffer to the disk under certain conditions, and then reuse the buffer. The process of writing data from memory to disk is called spill, and Chinese can be translated as overflow, which literally means intuitive. This overflow is done by a separate thread, without affecting the thread that writes the map result to the buffer. When the overflow thread starts, the result output of map should not be blocked, so the entire buffer has an overflow ratio of spill.percent. This ratio defaults to 0.8, which means that when the buffer data has reached the threshold (buffer Size*spill PERCENT=100MB*0.8=80MB), the overflow thread starts, locks the 80MB memory, and executes the overflow process. The output of the Maptask can also be written in the remaining 20MB memory, with no effect.
When the overflow thread starts, it needs to sort the key in the 80MB space (sort). Sorting is the default behavior of the MapReduce model, where sorting is also the sort of serialized bytes.
Here we can imagine that because the output of the maptask is sent to a different reduce side, and the memory buffers do not merge the data that will be sent to the same reduce end, the merge should be reflected in the disk file . An overflow file written to disk can also be seen on the official map to merge the values of different reduce ends. So an important detail of the overflow process is that if there are a lot of key/value that need to be sent to a reduce end, then the key/value values need to be stitched together to reduce the index records associated with Partion.
When merging data for each reduce end, some data might look like this: "AAA"/1, "AAA"/1. For the wordcount example, simply count the number of occurrences of a word, and if there are many "AAA" results in the same maptask As multiple keys, we should combine their values into a piece, the process called reduce also called combine. But in the terms of MapReduce, reduce refers only to the process by which the reduce side performs calculations from multiple maptask. In addition to reduce, the informal merger of data can only be counted as combine, in fact, we all know that MapReduce will combine equivalent to reducer.
If the client is set to Combiner, now is the time to use combiner. Add the value of the key/value pair with the same key to reduce the amount of data that overflows to disk. Combiner optimizes the intermediate results of mapreduce, so it is used multiple times throughout the model. Which scenes can use combiner? From this analysis, the output of combiner is reducer input, combiner can never change the final calculation results. So from my point of view, combiner should only be used in scenarios where the input key/value of reduce is exactly the same as the output Key/value type and does not affect the final result. such as accumulation, the maximum value and so on. The use of combiner must be cautious, if used well, it is useful for job execution efficiency, and vice versa will affect the final result of reduce.
4. each overflow write will generate an overflow file on the disk, if the map output is really large, there are several times such an overflow occurred, the disk will have a corresponding overflow file exists. When the maptask is actually complete, the data in the memory buffer is also completely overset to the disk to form an overflow file. There will be at least one such overflow file in the final disk (if the map's output is very small, only one overflow file will be generated when map execution is complete), because there is only one final, so the overflow files need to be merged together, and this process is called merge. What is the merge? as in the previous example, "AAA" reads from an overflow file of a Maptask value is 5, from another maptask the overflow file reads the value of 8, because they have the same key, so the merge into group. What is group. The "AAA" is like this: {"AAA", [5,8,2,...]}, the values in the array are read from different overflow files, and then add them up. Note that because the merge is merging multiple overflow files into one file, there may also be the same key exists, in which case the client sets the combiner and uses combiner to merge the same key.
At this point, all the work on the map end is finished, and the resulting file is also stored in a local directory that Tasktracker can reach. Each reducetask continuously through RPC to obtain information from Jobtracker about whether the Maptask is complete, and if the reduce task is notified that maptask execution on a tasktracker is complete, The second half of the shuffle process starts.
To put it simply, Reducetask's work before execution is to constantly pull the final result of each maptask in the current job, and then merge the data that is never pulled from the ground, and eventually form a file as input to the reducetask. See:
such as map-side details, shuffle on the reduce end of the process can be shown on the figure three points to summarize. The current reduce copy data is premised on what maptask it is getting from jobtracker to the end of execution. Before reduce really runs, all the time is pulling data, doing the merge, and doing it repeatedly. As in the previous way, I also describe the shuffle details of the reduce side in a segmented manner.
1.the copy process, simply pull the data. The reduce process launches some data copy threads (Fetcher), requesting the tasktracker of the maptask to obtain Maptask output files via HTTP. Because Maptask is already over, these files are Tasktracker managed on the local disk.
2.merge stage. The merge here is like the merge action at map end, but the data is stored in different map-side copy values. The copied data will first be placed in the memory buffer, where the buffer size is more flexible than the map end, it is based on the JVM heap size setting, because the shuffle phase reducer not run, so should be the majority of memory to shuffle. It should be emphasized here that the merge has three forms: 1) memory to memory 2) memory to disk 3) disk to disk. The first form is not enabled by default, which makes people more confused, right. When the amount of data in memory reaches a certain threshold, the memory-to-disk merge is started. Similar to the map end, this is also an overflow process, this process if you set up a combiner, it will be enabled, and then on the disk generated a large number of overflow files. The second merge mode is running until the data at the map end is not finished, and then the third disk-to-disk merge mode is generated to generate the final file.
3.reducer input file. After a constant merge, a "final file" is eventually generated. Why enclose the quotation marks? Because this file may exist on disk, it may also exist in memory. For us, of course we want it to be stored in memory, directly as input to the reducer, but by default, this file is stored on disk. As for how to make this file appear in memory, after the performance tuning I say. When the reducer input file is set, the entire shuffle is finally finished. Then the reducer executes and the results are placed on HDFs.
Above is the whole shuffle process. A lot of details, I skipped a lot, just try to clear the point. Of course, I may also have some understanding or expression on a lot of questions, generous advice, I hope to constantly improve and revise this article, can make it easy to understand, read to know all aspects of shuffle. As to the specific principle of implementation, you are interested in their own to explore, if not convenient, leave a message to me, I will study and feedback.
Detailed description of the MapReduce shuffle process