2. algorithm ideas
In fact, when we want to design a traditional serial sorting algorithm into a parallel sorting algorithm, we usually think of a divide-and-conquer strategy, that is: divide the data to be sorted into M data blocks (which can be done using the hash method), and then each map task performs partial sorting on a data block, A reduce task sorts all data in full order. This design can ensure a high degree of parallelism in the map stage, but there is no parallelism in the reduce stage.
To improve the degree of parallelism in the reduce stage, the terasort job improves the preceding algorithm: In the map stage, each map task divides the data into R data blocks (r is the number of reduce tasks ), all data in the I (I> 0) data blocks is larger than that in the I + 1 data blocks. In the reduce stage, the I reduce task is processed (sorted) the I block of all MAP tasks. In this way, the result of the I reduce task is larger than that of the I + 1 task ~ The output of the sorting results of R reduce tasks is the final sorting result. This design concept is obviously more efficient than the first one, but it is difficult to implement. It needs to solve the following two technical difficulties: first, how to determine the range of R data blocks for each map task data? 2. If you want to quickly determine which data block a piece of data belongs? The answer is [sampling] and [trie tree] respectively ].
3. terasort Algorithm
3.1 terasort algorithm flow
The hadoop terasort sorting algorithm consists of three steps: Sampling-> map task to mark data records-> reduce task to perform local sorting.
Data Sampling is performed on the jobclient. First, some data is extracted from the input data, sorted, and divided into R data blocks, find the data ceiling and deprecation (called "Split points") of each data block and save these Split points to the distributed cache.
In the map stage, each map task first reads the Split points from the distributed cache and establishes a trie tree (two-layer trie tree, the reduce task number corresponding to the node is saved on the leaf node of the tree ). Then start processing the data. For each piece of data, find the number of the reduce task in the trie tree and save it.
In the reduce stage, each reduce task reads its corresponding data from each map task for local sorting, and finally outputs the result after the reduce task is processed by the reduce task number in sequence.
3.2 key points of the terasort Algorithm
(1) Sampling
Hadoop comes with many data sampling tools, including intercalsmapler, randomsampler, and splitsampler (for details, see org. Apache. hadoop. mapred. Lib ).
Sample data entries: samplesize = Conf. getlong ("terasort. partitions. Sample", 100000 );
Number of splits selected: samples = math. Min (10, splits. Length); splits is an array composed of all splits.
Number of data entries extracted by each split: recordspersample = samplesize/samples;
Sort the sampled data in full order, write the obtained "split point" to file _ partition. lst, and store it in the distributed cache area.
Example: for example, the sample data is B, ABC, Abd, BCD, ABCD, EFG, HiI, AFD, rrr, mnk.
After sorting, ABC, ABCD, Abd, AFD, B, BCD, EFG, HiI, mnk, rrr are obtained.
If the number of reduce tasks is 4, the Split points are Abd, BCD, and mnk.
(2) map tasks mark data records
Each map task reads the split point from the file _ partition. lst and creates a trie tree (assumed to be 2-trie, that is, the organization uses the first two bytes ).
The map task reads data from one of the splits and searches for the reduce task numbers corresponding to each record in the trie tree. For example, ABG corresponds to the second reduce task and MNZ corresponds to the fourth reduce task.
(3) reduce tasks perform local sorting
Each reduce task performs local sorting and outputs results in sequence.
4. References
(1) hadoop's 1 TB sorting terasort:
Http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html
(2) Hadoop-0.20.2 code
(3) http://sortbenchmark.org/
Original article, reprinted Please note:Reposted from Dong's blog
Link:Http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/
Dong, Author: http://dongxicheng.org/about/
A collection of articles in this blog:Http://dongxicheng.org/recommend/