Terasort algorithm analysis in Hadoop 1

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Analysis of Terasort algorithm in Hadoop

1. Overview
1TB sequencing is typically used to measure the data processing capabilities of a distributed data processing framework. Terasort is a sort job in Hadoop, and in 2008, Hadoop won the first place in the 1TB sort benchmark evaluation, taking 209 seconds. So how is terasort implemented in Hadoop? This paper mainly analyzes the Terasort job from the angle of algorithm design.
2. Algorithm idea

In fact, when we want to design the traditional serial sorting algorithm as a parallel sorting algorithm, we usually think of the strategy of divide and conquer, that is: the data to be sorted into m data blocks (can be done in hash), and then each maptask to a block of local sorting, and then, A reducetask all the data in full order. This design approach guarantees a high degree of parallelism in the map phase, but there is no parallel at all in the reduce phase.

To improve the degree of parallelism in the reduce phase, the Terasort job improves the above algorithm: in the map phase, each maptask divides the data into R blocks (R is the number of reducetask), where all the data for the Block I (i>0) will be i+ The data in the 1 is large; in the reduce phase, the I reducetask processing (sorting) The block I of all maptask, so that the result of the I-reducetask will be greater than the i+1, and finally the 1~r of the order of Reducetask ordered output, That is the final sort result. This design idea is obviously more efficient than the first one, but it is difficult to achieve, it needs to solve the following two technical difficulties: first, how to determine the range of R blocks for each maptask data. Second, for a piece of data, if you quickly determine which data block it belongs to. The answers are "sampling" and "trie tree", respectively.

3. Terasort algorithm
3.1Terasort algorithm Flow
For the Terasort sorting algorithm of Hadoop, it consists of 3 steps: Sampling –>>maptask to mark –>>reducetask for the data record and local sort.

Data sampling takes place on the jobclient side, first extracting part of the data from the input data, sorting the data, then dividing them into r blocks, identifying the data caps and downline (called "split points") for each block, and saving the split points to the distributed cache.
In the map phase, each Maptask first reads the split point from the distributed cache and establishes the trie tree (two-layer trie tree, which holds the reducetask number of the node on the tree's leaf node). Then formally start processing the data, for each data, in the trie tree to find the number of the reducetask it belongs to, and save it.
In the reduce phase, each reducetask reads its corresponding data from each maptask for local ordering, and finally reducetask processing results are sequentially output by reducetask number.
3.2Terasort algorithm key point (1) sampling
Hadoop comes with a lot of data sampling tools, including Intercalsmapler,randomsampler,splitsampler, etc. (see org.apache.hadoop.mapred.lib for details).
Number of sampled data bars: Samplesize=conf.getlong ("Terasort.partitions.sample", 100000);
Selected split number: Samples=math.min (10,splits.length); splits is an array of all split elements. Number of data strips extracted per split: Recordspersample=samplesize/samples;
The sampled data is fully sorted, the obtained "Split point" is written to the file _partition.lst, and it is stored in the distributed buffer.
Example: For example, the sampling data is b,abc,abd,bcd,abcd,efg,hii,afd,rrr,mnk by sorting, get: Abc,abcd,abd,afd,b,bcd,efg,hii,mnk, RRR if the number of Reducetask is 4, the split point is: Abd,bcd,mnk (2) Maptask mark the data record
Each maptask reads the split point from the file _partition.lst and creates the trie tree (assuming 2-trie, that is, the organization takes advantage of the first two bytes).

Maptask reads data from split and finds the Reducetask number for each record through the trie tree. For example: ABG corresponds to the second REDUCETASK,MNZ corresponding to the fourth reducetask.

(3) Reducetask for local sorting
Each reducetask is ordered locally, and the result is then output. 4. References
(1) 1TB sequencing terasort for Hadoop:
Http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html (2) Hadoop-0.20.2 code
(3) http://sortbenchmark.org

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Terasort algorithm analysis in Hadoop 1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Terasort algorithm analysis in Hadoop 1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support