Job Flow: Shuffle detailed

Source: Internet
Author: User
Tags shuffle

This article to undertake the job Process: Mapper class analysis . MapReduce ensures that each reducer input is sorted by key, and the data from the map output to the reducer input becomes shuffle.

Map End

1). spill overflow write. Each map () method outputs the processing result to a ring memory buffer buf (100MB ) ( threshold 0.8 (mapreduce.map.sort.spill. percent), a background thread is started to overflow the data in the buffer ( Span style= "color: #ff00ff;" >spill to disk ) into the directory specified by the local disk (Mapreduce.cluster.local.dir). The overflow thread starts, locking this 80MB of memory first, performing a series of actions related to the overflow before the write. The map output continues to be written in the remaining 20MB memory and does not affect each other. During an overflow disk, if the buffer is filled, the map output is blocked until the overflow disk process is complete.

2). partition and sort The associated operation before the overflow thread writes to the disk: first, the data in memory is divided into the corresponding , according to the map output. Then, in each partition, press key to sort inside. If you have combiner action , it will be done on the output after sorting. When the above steps are complete, the overflow thread begins to write to the disk.

Note : compressing the map output while writing a disk can not only speed up the write disk, save disk space, but also reduce the amount of data passed to reduce. The default is no compression, boot compression as long as the mapreduce.map.output.compress set to True . See

system default Hashpartition: Just put the key hash after the number of reducetask to modulo, so generally speaking, different key assigned to which Reducer is immediately! Therefore, the data within a single reducer is ordered, but the data between the reducer is chaotic ! To sort the data in its entirety: ① only one Reducer,② use totalorderpartitioner!

3). Merge merges . each overflow writes a new overflow file, so a 80MB overflow file is generated continuously as the map results are written to disk. Before the map phase is complete, all overflow files are merged (or called Group Group) into a partitioned and sorted map output file, which is based on the byte-stream ordering process . The Mapreduce.task.io.sort.factor property controls the maximum number of overflow write files that are merged at a time, by default 10. If a combiner operation is established, it will run on the merged large file.

Note : merge is not compared between different partition keys, only the same partition key will be sorted and merged.

algorithm for merge: each spill file Key/value is ordered, but different files are chaotic, similar to multiple sequential files of the multi-path merging algorithm. First, remove the smallest key/value of the spillfile that need to merge, put it in a memory heap, and then fetch a minimum value from the heap each time, and save this value to the merge output file. This is very similar to the algorithm for the scan in HBase!

4). Map End Summary :

  1. The partition partition for the map output is done before the memory buf is written . We can implement a custom partition by inheriting the Partitioner class and divide the data we want into the same reducer.
  2. The map output will also continue during the spill process. Therefore, the tuning of memory BUF related parameters is one of the emphases of Mr Tuning .
  3. Sorting is the default behavior of Mr, in-memory sorting is a comparison of structured objects , and the CompareTo () method is called. The merge stage sort is to sort the serialized byte array and call the Compare () method in the comparator comparer.
  4. Combine will take place during the spill and merge phases. Combiner is based on key to deal with the map results, reducing The data transfer between map and reduce . However, it is important to note that not all scenarios are suitable for combine, such as averages.
  5. The combiner itself has implemented the reduce () operation, why is the reduce () operation also performed in the reducer phase? back A :combiner only deals with the map intermediate results of each node itself, while Reducer collects the map results of each node , and then handles them uniformly.

How does reduce know how to get the map output from that nm?

a). After the map task completes successfully, it notifies the Mr-am state that the status has been updated through the heartbeat mechanism. Therefore, for the mr-am of the specified job, you know the map output mapping relationship. In reduce, a thread periodically asks Mr-am to obtain the location of the map output until reduce obtains the output location of all maps.

b). Because reducer may fail, Mr-am does not immediately remove the map output from disk when the first reducer retrieves it. Instead, the mr-am waits until the entire MR job is complete before deleting the map output.

Reduce End

5). HTTP requests . The map output file is saved on the local disk of the Nodemanage node running the map task. Reducer copies map intermediate results from each NM via http , and each NM processes these HTTP requests through Jetty server, so you can properly configure the number of worker threads that Tune Jetty server ( Mapreduce.tasktracker.http.threads, default 40). This setting is for the entire Mr Task, not for each map sub-task. On large clusters running large jobs, this value can be adjusted as needed.

6). Copy phase . Now, NM needs to run the reduce task for the partition file. Further, the reduce task requires an intermediate result of several map tasks on the cluster as its special partition file. Each map task may finish differently, so as soon as a map task is completed, the reduce task begins copying its output. This is the copy phase (copy phase) . The reduce task by default has 5 threads copying data from the map side, corresponding to the attribute mapreduce.reduce.shuffle.parallelcopies.

7). Sort stage. The map result is first copied to the memory buffer of the reduce node (mapreduce.reduce.shuffle.input.buffer.percent, default 0.70), reaching the buffer threshold ( Mapreduce.reduce.shuffle.merge.percent, default 0.66), the merge is overset to the local disk. The reduce task goes into the sort phase (sort phase)as the number of files on the disk increases. More appropriately, the merge phase , since the sorting is done on the map side, this phase merges the map output to maintain its sequential ordering. A merge is a loop. For example, if you have 50 map outputs, and the merge factor is ten (Mapreduce.task.io.sort.factor, default 10, similar to map merge), the merge will take 5 trips. Each trip combines 10 files into one file, so there are finally 5 intermediate files.

Note meaning : In order to merge, the compressed map output must be uncompressed in memory.  

8). Execute reduce. In the final phase, the reduce phase, the 5 intermediate files are entered directly into the reduce () function , which omits the round trip of merging the disk and then reading the data from the disk. The final merge can come from both memory and disk fragments. In the reduce phase, the reduce () function is called once for each key in the sorted input. The output of this phase is written directly into HDFs, and this NM node holds the first block copy (block replica).

Job Flow: Shuffle detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.