[Hadoop reprint] tearsort

Source: Internet
Author: User


1 TB sorting is usually used to measure the data processing capability of the Distributed Data Processing framework. Terasort is a sorting job in hadoop. In 2008, hadoop won the first place in the 1 TB sorting benchmark evaluation, which took 209 seconds. How is terasort implemented in hadoop? This article mainly analyzes terasort jobs from the perspective of algorithm design.

2. algorithm ideas

In fact, when we want to design a traditional serial sorting algorithm into a parallel sorting algorithm, we usually think of a divide-and-conquer strategy, that is: divide the data to be sorted into M data blocks (which can be done using the hash method), and then each map task performs partial sorting on a data block, A reduce task sorts all data in full order. This design can ensure a high degree of parallelism in the map stage, but there is no parallelism in the reduce stage.

To improve the degree of parallelism in the reduce stage, the terasort job improves the preceding algorithm: In the map stage, each map task divides the data into R data blocks (r is the number of reduce tasks ), all data in the I (I> 0) data blocks is larger than that in the I + 1 data blocks. In the reduce stage, the I reduce task is processed (sorted) the I block of all MAP tasks. In this way, the result of the I reduce task is larger than that of the I + 1 task ~ The output of the sorting results of R reduce tasks is the final sorting result. This design concept is obviously more efficient than the first one, but it is difficult to implement. It needs to solve the following two technical difficulties: first, how to determine the range of R data blocks for each map task data? 2. If you want to quickly determine which data block a piece of data belongs? The answer is [sampling] and [trie tree] respectively ].

3. terasort Algorithm

3.1 terasort algorithm flow

The hadoop terasort sorting algorithm consists of three steps: Sampling-> map task to mark data records-> reduce task to perform local sorting.

Data Sampling is performed on the jobclient. First, some data is extracted from the input data, sorted, and divided into R data blocks, find the data ceiling and deprecation (called "Split points") of each data block and save these Split points to the distributed cache.

In the map stage, each map task first reads the Split points from the distributed cache and establishes a trie tree (two-layer trie tree, the reduce task number corresponding to the node is saved on the leaf node of the tree ). Then start processing the data. For each piece of data, find the number of the reduce task in the trie tree and save it.

In the reduce stage, each reduce task reads its corresponding data from each map task for local sorting, and finally outputs the result after the reduce task is processed by the reduce task number in sequence.

3.2 key points of the terasort Algorithm

(1) Sampling

Hadoop comes with many data sampling tools, including intercalsmapler, randomsampler, and splitsampler (for details, see org. Apache. hadoop. mapred. Lib ).

Sample data entries: samplesize = Conf. getlong ("terasort. partitions. Sample", 100000 );

Number of splits selected: samples = math. Min (10, splits. Length); splits is an array composed of all splits.

Number of data entries extracted by each split: recordspersample = samplesize/samples;

Sort the sampled data in full order, write the obtained "split point" to file _ partition. lst, and store it in the distributed cache area.

Example: for example, the sample data is B, ABC, Abd, BCD, ABCD, EFG, HiI, AFD, rrr, mnk.

After sorting, ABC, ABCD, Abd, AFD, B, BCD, EFG, HiI, mnk, rrr are obtained.

If the number of reduce tasks is 4, the Split points are Abd, BCD, and mnk.

(2) map tasks mark data records

Each map task reads the split point from the file _ partition. lst and creates a trie tree (assumed to be 2-trie, that is, the organization uses the first two bytes ).

The map task reads data from one of the splits and searches for the reduce task numbers corresponding to each record in the trie tree. For example, ABG corresponds to the second reduce task and MNZ corresponds to the fourth reduce task.

 

(3) reduce tasks perform local sorting

Each reduce task performs local sorting and outputs results in sequence.

4. References

(1) hadoop's 1 TB sorting terasort:

Http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html

(2) Hadoop-0.20.2 code

(3) http://sortbenchmark.org/

Original article, reprinted Please note:Reposted from Dong's blog

Link:Http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/

Dong, Author: http://dongxicheng.org/about/

A collection of articles in this blog:Http://dongxicheng.org/recommend/

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.