Thorough understanding of the MapReduce shuffle process principle

Source: Internet
Author: User
Tags crc32 checksum http request shuffle
Introduction to the shuffle process of MapReduce

Shuffle's original meaning is shuffle, mixed wash, a set of data with certain rules as far as possible into a set of irregular data, the more random the better. The shuffle in MapReduce is more like the inverse process of shuffling, converting an irregular set of data into a set of data with certain rules.

Why the MapReduce computing model requires a shuffle process. We all know that the MapReduce computational model typically consists of two important stages: map is the map, responsible for filtering and distributing the data; Reduce is the statute that is responsible for the computation and merging of the data. The data for reduce is derived from the Map,map output, which is the input of reduce, which needs to be shuffle to obtain the data.

The entire process from map output to the reduce input can be broadly referred to as shuffle. The shuffle spans the map and reduce ends, including the spill process at the map end, including the copy and sort processes on the reduce side, as shown in the figure:


spill Process

The spill process includes steps such as output, sort, overflow, merge, and so on, as shown in the figure:

Collect

Each map task continuously outputs data in pairs to a circular data structure constructed in memory. The circular data structure is used to make more efficient use of memory space and to place as much data as possible in memory.

This data structure is actually a byte array, called Kvbuffer, the name of the meaning, but this is not only placed in the data, but also placed some index data, to place the index data in the area of a Kvmeta alias, A intbuffer (the byte order of the platform itself) is worn on a kvbuffer area. The data region and the index data region in the Kvbuffer are adjacent nonoverlapping two regions, with a dividing point to divide the two, the demarcation point is not immutable, but each time after the spill will be updated. The initial cutoff point is 0, the storage direction of the data is increasing upward, and the index data is stored in a downward direction, as shown in the figure:

Kvbuffer storage pointer Bufindex is always stuffy head upward growth, such as Bufindex initial value of 0, an int type key after writing, bufindex growth of 4, an int type of value after writing, bufindex growth of 8.

The index is the index to the Kvbuffer, is a four-tuple, including: The starting position of value, the starting position of key, the partition value, the length of value, the length of four int, the Kvmeta holding pointer kvindex every time it jumps down four "lattice", Then the data of the four-tuple is filled in one grid and one grid. For example, the initial position of Kvindex is-4, when the first one is finished, the position of (kvindex+0) holds the starting position of value, the position of (kvindex+1) holds the starting position of the key, the position of (kvindex+2) holds the value of partition, ( KVINDEX+3) position holds the length of value, then kvindex jumps to 8 position, wait for the second and after the index is finished, Kvindex jumps to 32 position.

Although the size of the Kvbuffer can be set by the parameters, but the total is so large, and the index is constantly increasing, adding, kvbuffer there is not enough to use that day, then how to do. To brush data from memory to disk and then to write data to memory, the process of Kvbuffer data to disk is called spill, how clear, the memory of the data is full and automatically spill to a larger disk space.

About spill trigger conditions, that is, kvbuffer to what extent to start spill, or to pay attention to. If the kvbuffer used tightly, a little seam is not left when the beginning of spill, the map task will need to wait until spill complete space to continue to write data, if Kvbuffer is only full to a certain extent, such as 80% when the beginning of spill, At the same time as the spill, map task can continue to write data, if spill fast enough, map may not need to worry about free space. The confrontation can only phase is large and the latter is generally chosen.

Spill this important process is assumed by the spill thread, spill thread from the map task to "command" began to work formally, the job called Sortandspill, originally not only spill, before spill there is a controversial sort. Sort

First of all, the data in the Kvbuffer in ascending order of partition value and key two keywords, moving only index data, the result is kvmeta in the data in accordance with the partition to gather together, the same partition in accordance with key ordered. spill

The spill thread creates a disk file for this spill process: from all the local directories rotation find a directory that can store so much space, and then find a file in it that resembles "spill12.out". The spill thread spits the data into this file according to the ordered Kvmeta partition, and a partition corresponding data spits out the next partition sequentially until all the partition have been traversed. The data corresponding to a partition in the file is also called segment (segment).

All the partition corresponding data are placed in this file, although it is stored sequentially, but how to directly know the location of a partition in this file to store the starting position. The powerful index came out again. There is a ternary group that records the index of a partition corresponding data in this file: The starting position, the original data length, the compressed data length, and a partition corresponding to a ternary group. Then put these index information in memory, if the memory is not enough, the subsequent index information will need to write to the disk file: from all the local directory rotation to find a directory that can store so much space, found in it to create a file similar to "Spill12.out.index", The file not only stores the index data, but also stores the CRC32 checksum data. (Spill12.out.index is not necessarily created on disk, if it can be placed in memory (the default 1M space), even if it is created on disk, and the Spill12.out file is not necessarily in the same directory. )

Each time the spill process produces at least one out file, and sometimes an index file, the number of spill is also branded in the file name. The corresponding relationship between the index file and the data file is shown in the following figure:

While the spill thread is in full swing for sortandspill work, the map task does not stop, but it does the data output without any previous effort. Map or write the data into the Kvbuffer, the problem comes: only in accordance with the Bufindex pointer upward growth, KVMETA only to follow the kvindex downward growth, is to keep the starting position of the pointer continue to run, or another way to seek it. If the starting position of the pointer remains the same, soon Bufindex and Kvindex meet, and then start again or move the memory is more troublesome, not advisable. Map takes the middle position of the remaining space in the Kvbuffer, uses this position to set as the new dividing point, the bufindex pointer moves to this demarcation point, Kvindex moves to the -16 position of this demarcation point, then the two can put the data harmoniously according to their established trajectory, When the spill is complete and space is freed up, no changes need to be made to move forward. The conversion of the cutoff point is shown in the following figure:

The map task always writes the output data to the disk, even if the output data is small enough to fit in memory, and at the end, the data will be brushed to disk. Merge

Map task If the output data volume is very large, it may be done several times spill,out files and index files can produce many, distributed on different disks. Finally, the merge process for merging these documents was unveiled.

How did the merge process know where the spill files were produced? The resulting spill file is scanned from all local directories, and the path is stored in an array. How does the merge process know the spill index information? Yes, the index file is also scanned from all the local directories, and the indexed information is stored in a list. Here, there is a place worth wondering. Why not just store this information in memory in the previous spill process, why do you have to do this scan? In particular, the index data of the spill, before the memory overrun after the data written to the disk, now again from the disk to read the data, or need to load more memory. The reason why superfluous, because then kvbuffer this large memory is no longer used can be recycled, there is memory space to install the data. (for local tyrants with large memory space, it is worth considering the memory to save these two IO steps.) )

Then create a file called File.out for the merge process and a file called File.out.Index to store the final output and index.

A partition is a partition for merging output. For a partition, all index information corresponding to this partition is queried from the index list, and each corresponding segment is inserted into the segment list. That is, this partition corresponds to a segment list, which records the file name, starting position, length, and so on for the corresponding partition data in all the spill files.

All the segment corresponding to this partition are then merged, and the goal is to merge into a segment. When the partition corresponds to many segment, it is merged in batches: First out of the segment list, with key as the minimum heap, and then from the smallest heap each time the smallest output is taken to a temporary file. This will combine this batch of paragraphs into a temporary segment, add it back to the segment list, and then remove the second batch from the segment list to merge the output to a temporary segment, add it to the list, and execute it again and again until the remaining segment is a batch and output to the final file.

The final index data is still exported to the index file.


This concludes the shuffle process at map end. Copy

The reduce task drags the data needed for each map task through HTTP. Each node initiates a resident HTTP server, and one of the services is to respond to the reduce drag map data. When a Mapoutput HTTP request comes in, the HTTP server reads the data corresponding to the reduce section in the corresponding map output file and outputs it to reduce via the network stream.

The reduce task drags the data corresponding to a map and writes the data directly to memory if it can be placed in memory. Reduce to pull the data to each map, in memory each map corresponds to a piece of data, when the in-memory map data occupies a certain amount of space, start to start the in-memory merge, the memory of the data merge output to disk on the file.

If the map data cannot be placed in memory, write the map directly to disk, create a file in the local directory, read the data from the HTTP stream, and write to the disk, using a buffer size of 64K. Drag a map data to create a file, when the number of files reached a certain threshold, start to start the disk file merge, the files are combined to output to a file.

Some map data is small can be placed in memory, some map of the data large need to be placed on the disk, so that the final reduce task drag the data is placed in memory, some on the disk, and finally a global merger.


Merge Sort

The merge procedure used here is the same as the merge process used by the map end. The output data of the map is already ordered and merge is sorted once, and the so-called reduce-side sort process is the merging process. General reduce is a copy side sort, where the copy and sort two phases overlap rather than completely separate.

The shuffle process for the reduce side ends here.

Original: HTTP://WWW.CSDN.NET/ARTICLE/2014-05-19/2819831-TDW-SHUFFLE/1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.