The shuffle process is the core of mapreduce, also known as a miracle. To understand mapreduce, shuffle must be understood. I have read a lot of related materials, but every time I read them, it is difficult to clarify the general logic, but it is more and more confusing. Some time ago, when I was doing mapreduce job performance tuning, I needed to go deep into the code to study the mapreduce running mechanism, so that I could find out the shuffle. Considering that I was annoyed when I was reading the relevant materials but could not understand them, I tried to clarify shuffle as much as possible here, so that everyone who wants to know how it works can gain something. If you have any questions or suggestions about this article, please leave a message later. Thank you!
Shuffle normally means shuffling or messing up. You may be more familiar with the collections. Shuffle (list) method in Java APIs. It will randomly disrupt the element sequence in the parameter list. If you do not know what Shuffle is in mapreduce, see this figure:
This is an official description of the shuffle process. However, I am sure that you cannot understand the shuffle process from this image, because it is quite different from the facts, and the details are also disordered. I will describe the shuffle fact in detail later, so here you only need to know the approximate range of shuffle as-how to effectively transmit the output results of map tasks to the reduce end. Shuffle describes the process from map task output to reduce task input.
In a cluster environment such as hadoop, most map tasks and reduce tasks are executed on different nodes. Of course, in many cases, the result of map tasks on other nodes must be pulled across nodes during reduce execution. If the cluster is running many jobs, the normal execution of tasks will seriously consume network resources in the cluster. This kind of network consumption is normal and we can't limit it. What we can do is to minimize unnecessary consumption. In addition, in the node, disk Io has a considerable impact on the job completion time compared to the memory. From the basic requirements, our expectations for the shuffle process can be:
- Pulls data from the map task end to the reduce end completely.
- When pulling data across nodes, minimize unnecessary bandwidth consumption.
- Reduce the impact of disk Io on task execution.
OK. When you see this, you can stop and think about it. If you design this shuffle process yourself, what is your design goal. What I want to optimize is to reduce the amount of data pulled and try to use the memory instead of the disk.
My analysis is based on the source code of hadoop0.21.0. If it is different from the shuffle process you know, I will not explain it. I will take wordcount as an example and assume it has eight map tasks and three reduce tasks. As we can see, the shuffle process spans both the map and reduce ends, so I will expand it in two parts below.
First, let's look at the situation on the map side, such:
It may be the running status of a map task. Comparing it with the left half of the official image, we will find many inconsistencies. The official chart does not clearly explain the stage at which the partition, sort, and combiner are used. I drew this picture, hoping to give you a clear picture of the entire process from map data input to map data preparation.
The entire process is divided into four steps. To put it simply, each map task has a memory buffer and stores the map output result, when the buffer zone is full, you need to store the data in the buffer zone as a temporary file to the disk, after the entire map task is completed, merge all temporary files generated by the map task on the disk to generate the final formal output file, and wait for the reduce task to pull data.
Of course, each step may contain multiple steps and details. I will explain the details one by one:
1When a map task is executed, its input data comes from the HDFS block. Of course, in the mapreduce concept, map task only reads the split. The relationship between Split and block may be multiple-to-one. The default value is one-to-one. In the wordcount example, assume that the input data of map is a string like "AAA.
2After the Mapper operation, we know that the Mapper output is such a key/Value Pair: The Key is "AAA", and the value is 1. Because the current map end only performs the Add 1 operation, the result set is merged in the reduce task. We know that this job has three reduce tasks. which reduce does the current "AAA" need to be implemented.
Mapreduce provides the partitioner interface, which is used to determine which reduce task to process the output data based on the number of keys, values, and reduce. By default, key hash is followed by the number of reduce tasks. The default modulo mode is only for the average reduce processing capability. If you have requirements for partitioner, you can customize it and set it to the job.
In our example, "AAA" returns 0 after passing through partitioner, that is, the value of this pair should be handled by the first reducer. Next, you need to write data into the memory buffer. The buffer is used to collect map results in batches to reduce the impact of disk Io. The results of our key/value pairs and partition will be written to the buffer zone. Before writing, the key and value values are serialized into byte arrays.
The entire memory buffer is a byte array. I have not studied its byte index and key/value storage structure. If a friend has studied it, give a rough description of its details.
3. This memory buffer has a size limit. The default value is 100 MB. When the map task outputs a lot of results, the memory may burst, so you need to temporarily write data in the buffer to the disk under certain conditions, and then reuse this buffer. This process of writing data from memory to disk is called spill. It can be translated into overwrite Chinese, which is literally intuitive. This overwrite is completed by a separate thread and does not affect the thread that writes the map result to the buffer zone. The overwrite thread should not stop map output when it starts, so the entire buffer zone has an overwrite ratio of spill. percent. This ratio is 0.8 by default, that is, when the buffer data reaches the threshold (buffer size * spill percent = 100 MB * 0.8 = 80 Mb), The overwrite thread starts, lock the 80 MB memory and execute the overflow write process. The output result of the map task can also be written to the remaining 20 MB memory, which does not affect each other.
When the overflow write thread starts, sort the keys in the 80 Mb space (SORT ). Sorting is the default behavior of the mapreduce model. The sorting here is also the sorting of serialized bytes.
Here we can think about it, because the output of map tasks needs to be sent to different reduce ends, while the memory buffer does not merge the data sent to the same reduce end, therefore, such merging should be reflected in disk files. From the official figure, we can also see that the overflow Files written to the disk are merged for different reduce-end values. Therefore, an important part of the overwrite process is that if many key/value pairs need to be sent to a reduce end, these key/value values need to be spliced into one piece, reduce index records related to partition.
When merging data for each reduce end, some data may be like "AAA"/1, "AAA"/1. In the wordcount example, the number of times a word appears is simply counted. If there are many keys in the results of the same map task that appear multiple times like "AAA, we should merge their values into one piece. This process is also called reduce. However, in mapreduce terminology, reduce only refers to the process in which the reduce end obtains data from multiple map tasks for computing. In addition to reduce, data can only be combined in an informal manner. As you know, mapreduce treats combiner as CER Cer.
If the client has set combiner, it is time to use combiner. Add the values of key/value pairs with the same key to reduce the amount of data that overflows into the disk. Combiner optimizes the intermediate results of mapreduce, so it is used multiple times throughout the Model. In which scenarios can combiner be used? From this analysis, the output of the combiner is the CER input, and the combiner must not change the final calculation result. So from my perspective, combiner should only be used in scenarios where the input key/value type of reduce is exactly the same as that of the output key/value type and does not affect the final result. For example, accumulate and maximum. The use of combiner must be careful. If it is used well, it helps the job execution efficiency, and vice versa, it will affect the final result of reduce.
4. Each write overflow will generate an overwrite file on the disk. If the map output result is really large, such overwrites occur multiple times, multiple overflow files exist on the disk. When the map task is completed, all the data in the memory buffer is written to the disk to form an overflow file. At least one such overwrite file exists in the final disk (if the map output result is very small, only one overwrite file will be generated when the map execution is complete ), because there is only one final file, we need to merge these overflow files together. This process is called merge. What is merge like? As in the previous example, "AAA" reads 5 values from a map task and 8 values from another map. Because they have the same key, merge must be converted into a group. What is group. For "AAA" is like this: {"AAA", [5, 8, 2,…]}, The values in the array are read from different overflow files, and then add these values. Note that because merge combines multiple overflow files into one file, the same key may exist. If the client sets combiner during this process, combiner is also used to merge the same key.
At this point, all the work on the map end has been completed, and the final generated file is also stored in a local directory that tasktracker can obtain. Each reduce task continuously obtains information about whether the map task is completed from jobtracker through rpc. If the reduce task is notified, the map task on a tasktracker is completed, the second half of shuffle starts.
To put it simply, the job of reduce tasks before execution is to constantly pull the final results of each map task in the current job, and then perform merge on the Data pulled from different places, finally, a file is formed as the input file of reduce task. See:
For example, in the detailed map, the shuffle process on the reduce end can also be summarized by the three points marked in the figure. At present, the premise of reduce copy data is that it needs to obtain from jobtracker which map tasks have been executed and ended, and this process is not a table. If you are interested, you can pay attention to it. Before CER is actually running, all the time is pulling data, doing merge, and constantly repeating. As in the previous method, I will describe the shuffle details of the reduce end in segments as follows:
1. Copy process, simple data pulling. The reduce process starts some data copy threads (Fetcher) and requests the tasktracker of the map task to obtain the output file of the map task through HTTP. Because the map task has already ended, these files are managed by tasktracker on the local disk.
2. Merge stage. Here, the merge action is like the merge action on the map side, but the values stored in the array are the copy values of different map terminals. The copied data is first put into the memory buffer. The buffer size here is more flexible than that on the map end. It is set based on the JVM heap size because the reducer does not run in the shuffle stage, therefore, the vast majority of memory should be used for shuffle. Here, merge has three forms: 1) memory to memory 2) memory to disk 3) disk to disk. By default, the first mode is disabled, which is confusing, right. When the data volume in the memory reaches a certain threshold, the merge from the memory to the disk is started. Similar to the map end, this is also an overwrite process. If you set a combiner in this process, it will also be enabled, and a large number of overwrite files will be generated on the disk. The second mode of merge is running until the data on the map end ends. Then, the third mode of Disk-to-disk merge is started to generate the final file.
3. Cer CER input file. After merge continues, a "final file" will be generated ". Why quotation marks? This file may exist on the disk or in the memory. For us, of course we want it to be stored in the memory and directly used as the CER input, but by default, this file is stored in the disk. As for how to make this file appear in the memory, I will talk about performance optimization later. When the reducer input file is set, the entire shuffle ends. Then execute reducer and put the result on HDFS.
The above is the entire shuffle process. I skipped a lot of details. I just tried to clarify the points. Of course, I may also have many problems in understanding or expressing. I hope to constantly improve and modify this article so that it is easy to understand. After reading this article, I will be able to understand all aspects of shuffle. As for the specific implementation principle, if you are interested, you can explore it by yourself. If it is not convenient, leave a message to me. I will study it again and provide feedback.
Address: http://langyu.iteye.com/blog/992916
Mapreduce: Describes the shuffle Process