Terasort algorithm analysis in Hadoop 1

Source: Internet
Author: User
Analysis of Terasort algorithm in Hadoop

1. Overview
1TB sequencing is typically used to measure the data processing capabilities of a distributed data processing framework. Terasort is a sort job in Hadoop, and in 2008, Hadoop won the first place in the 1TB sort benchmark evaluation, taking 209 seconds. So how is terasort implemented in Hadoop? This paper mainly analyzes the Terasort job from the angle of algorithm design.
2. Algorithm idea

In fact, when we want to design the traditional serial sorting algorithm as a parallel sorting algorithm, we usually think of the strategy of divide and conquer, that is: the data to be sorted into m data blocks (can be done in hash), and then each maptask to a block of local sorting, and then, A reducetask all the data in full order. This design approach guarantees a high degree of parallelism in the map phase, but there is no parallel at all in the reduce phase.


To improve the degree of parallelism in the reduce phase, the Terasort job improves the above algorithm: in the map phase, each maptask divides the data into R blocks (R is the number of reducetask), where all the data for the Block I (i>0) will be i+ The data in the 1 is large; in the reduce phase, the I reducetask processing (sorting) The block I of all maptask, so that the result of the I-reducetask will be greater than the i+1, and finally the 1~r of the order of Reducetask ordered output, That is the final sort result. This design idea is obviously more efficient than the first one, but it is difficult to achieve, it needs to solve the following two technical difficulties: first, how to determine the range of R blocks for each maptask data. Second, for a piece of data, if you quickly determine which data block it belongs to. The answers are "sampling" and "trie tree", respectively.


3. Terasort algorithm
3.1Terasort algorithm Flow
For the Terasort sorting algorithm of Hadoop, it consists of 3 steps: Sampling –>>maptask to mark –>>reducetask for the data record and local sort.

Data sampling takes place on the jobclient side, first extracting part of the data from the input data, sorting the data, then dividing them into r blocks, identifying the data caps and downline (called "split points") for each block, and saving the split points to the distributed cache.
In the map phase, each Maptask first reads the split point from the distributed cache and establishes the trie tree (two-layer trie tree, which holds the reducetask number of the node on the tree's leaf node). Then formally start processing the data, for each data, in the trie tree to find the number of the reducetask it belongs to, and save it.
In the reduce phase, each reducetask reads its corresponding data from each maptask for local ordering, and finally reducetask processing results are sequentially output by reducetask number.
3.2Terasort algorithm key point (1) sampling
Hadoop comes with a lot of data sampling tools, including Intercalsmapler,randomsampler,splitsampler, etc. (see org.apache.hadoop.mapred.lib for details).
Number of sampled data bars: Samplesize=conf.getlong ("Terasort.partitions.sample", 100000);
Selected split number: Samples=math.min (10,splits.length); splits is an array of all split elements. Number of data strips extracted per split: Recordspersample=samplesize/samples;
The sampled data is fully sorted, the obtained "Split point" is written to the file _partition.lst, and it is stored in the distributed buffer.
Example: For example, the sampling data is b,abc,abd,bcd,abcd,efg,hii,afd,rrr,mnk by sorting, get: Abc,abcd,abd,afd,b,bcd,efg,hii,mnk, RRR if the number of Reducetask is 4, the split point is: Abd,bcd,mnk (2) Maptask mark the data record
Each maptask reads the split point from the file _partition.lst and creates the trie tree (assuming 2-trie, that is, the organization takes advantage of the first two bytes).

Maptask reads data from split and finds the Reducetask number for each record through the trie tree. For example: ABG corresponds to the second REDUCETASK,MNZ corresponding to the fourth reducetask.


(3) Reducetask for local sorting
Each reducetask is ordered locally, and the result is then output. 4. References
(1) 1TB sequencing terasort for Hadoop:
Http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html (2) Hadoop-0.20.2 code
(3) http://sortbenchmark.org

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.