Hadoop Terasort Benchmark Test experiment

Source: Internet
Author: User
Tags benchmark sort hadoop mapreduce hadoop fs


Author:zhankunlin
Date:2011-4-1
Key Words:hadoop, Terasort

< a > Terasort introduction

1TB sequencing is typically used to measure the data processing capabilities of a distributed data processing framework. Terasort is a sort job in Hadoop, and in 2008, Hadoop won the first place in the 1TB sort benchmark evaluation, taking 209 seconds.

< two > Related Materials

Hadoop mapreduce Scalability Test: http://cloud.csdn.net/a/20100901/278934.html
Using MPI to realize Hadoop:map/reduce Terasort http://emonkey.blog.sohu.com/166546157.html
Terasort algorithm analysis in Hadoop: http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/
1TB sort terasort:http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html for Hadoop
Sort benchmark:http://sortbenchmark.org/
Trir Tree: http://www.cnblogs.com/cherish_yimi/archive/2009/10/12/1581666.html
< three > experiment

(0) Source location
/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/src/examples/org/apache/hadoop/examples/terasort

(1) First execute Teragen generate data

[Root@gd86 hadoop-0.20.1]#/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen 1000000 Terasort/1000000-input

View the generated data

[Root@gd86 hadoop-0.20.1]#/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop fs-ls/user/root/terasort/ 1000000-input
Found 3 Items
Drwxr-xr-x-root supergroup 0 2011-03-31 16:21/user/root/terasort/1000000-input/_logs
-rw-r--r--3 root supergroup 50000000 2011-03-31 16:21/user/root/terasort/1000000-input/part-00000
-rw-r--r--3 root supergroup 50000000 2011-03-31 16:21/user/root/terasort/1000000-input/part-00001

Generate two data, each size is 50000000 B =

[Root@gd86 hadoop-0.20.1]# bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen terasort/1000000-input
Will generate two B data, plus a total of B = 1 kb

The resulting data row is 100B, the parameter 10 indicates that 10 rows are generated, and a total of 1000b;1,000,000 rows have 100,000,000 B = M;

Teragen is a two map to complete the data generation, each map generated a file, two file size of a total of eight m, each is a.

[Root@gd86 hadoop-0.20.1]# bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen 10000000 terasort/1G-input

This results in 1 G of data, which is divided into 16 blocks, since the data block is a single piece of data, and there are 64 map tasks when running Terasort.

[Root@gd86 hadoop-0.20.1]# bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen 10000000 terasort/1G-input
Generating 10000000 using 2 maps with step of 5000000
11/04/01 17:02:46 INFO mapred. Jobclient:running job:job_201103311423_0005
11/04/01 17:02:47 INFO mapred. Jobclient:map 0% Reduce 0%
11/04/01 17:03:00 INFO mapred. Jobclient:map 19% Reduce 0%
11/04/01 17:03:01 INFO mapred. Jobclient:map 41% Reduce 0%
11/04/01 17:03:03 INFO mapred. Jobclient:map 52% Reduce 0%
11/04/01 17:03:04 INFO mapred. Jobclient:map 63% Reduce 0%
11/04/01 17:03:06 INFO mapred. Jobclient:map 74% Reduce 0%
11/04/01 17:03:10 INFO mapred. Jobclient:map 91% Reduce 0%
11/04/01 17:03:12 INFO mapred. Jobclient:map 100% Reduce 0%
11/04/01 17:03:14 INFO mapred. Jobclient:job complete:job_201103311423_0005
11/04/01 17:03:14 INFO mapred. Jobclient:counters:6
11/04/01 17:03:14 INFO mapred. Jobclient:job Counters
11/04/01 17:03:14 INFO mapred. jobclient:launched Map tasks=2
11/04/01 17:03:14 INFO mapred. Jobclient:filesystemcounters
11/04/01 17:03:14 INFO mapred. jobclient:hdfs_bytes_written=1000000000
11/04/01 17:03:14 INFO mapred. Jobclient:map-reduce Framework
11/04/01 17:03:14 INFO mapred. Jobclient:map input records=10000000
11/04/01 17:03:14 INFO mapred. Jobclient:spilled records=0
11/04/01 17:03:14 INFO mapred. Jobclient:map input bytes=10000000
11/04/01 17:03:14 INFO mapred. Jobclient:map Output records=10000000


(2) Perform terasort sorting

Execution of the Terasort program will execute 16 maptask

ROOT@GD38 hadoop-0.20.1# bin/hadoop jar Hadoop-0.20.1-examples.jar terasort terasort/1g-input terasort/1G-output

11/03/31 17:12:49 INFO Terasort. Terasort:starting
11/03/31 17:12:49 INFO mapred. Fileinputformat:total input paths to Process:2
11/03/31 17:13:05 INFO util. Nativecodeloader:loaded The Native-hadoop Library
11/03/31 17:13:05 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
11/03/31 17:13:05 INFO Compress. Codecpool:got brand-new Compressor
Making 1 from 100000 records
Step size is 100000.0
11/03/31 17:13:06 INFO mapred. Jobclient:running job:job_201103311423_0006
11/03/31 17:13:07 INFO mapred. Jobclient:map 0% Reduce 0%
11/03/31 17:13:20 INFO mapred. Jobclient:map 12% Reduce 0%
11/03/31 17:13:21 INFO mapred. Jobclient:map 37% Reduce 0%
11/03/31 17:13:29 INFO mapred. Jobclient:map 50% Reduce 2%
11/03/31 17:13:30 INFO mapred. Jobclient:map 75% Reduce 2%
11/03/31 17:13:32 INFO mapred. Jobclient:map 75% Reduce 12%
11/03/31 17:13:36 INFO mapred. Jobclient:map 87% Reduce 12%
11/03/31 17:13:38 INFO mapred. Jobclient:map 100% Reduce 12%
11/03/31 17:13:41 INFO mapred. Jobclient:map 100% Reduce 25%
11/03/31 17:13:44 INFO mapred. Jobclient:map 100% Reduce 31%
11/03/31 17:13:53 INFO mapred. Jobclient:map 100% Reduce 33%
11/03/31 17:14:02 INFO mapred. Jobclient:map 100% Reduce 68%
11/03/31 17:14:05 INFO mapred. Jobclient:map 100% Reduce 71%
11/03/31 17:14:08 INFO mapred. Jobclient:map 100% Reduce 75%
11/03/31 17:14:11 INFO mapred. Jobclient:map 100% Reduce 79%
11/03/31 17:14:14 INFO mapred. Jobclient:map 100% Reduce 82%
11/03/31 17:14:17 INFO mapred. Jobclient:map 100% Reduce 86%
11/03/31 17:14:20 INFO mapred. Jobclient:map 100% Reduce 90%
11/03/31 17:14:23 INFO mapred. Jobclient:map 100% Reduce 93%
11/03/31 17:14:26 INFO mapred. Jobclient:map 100% Reduce 97%
11/03/31 17:14:32 INFO mapred. Jobclient:map 100% Reduce 100%
11/03/31 17:14:34 INFO mapred. Jobclient:job complete:job_201103311423_0006
11/03/31 17:14:34 INFO mapred. Jobclient:counters:18
11/03/31 17:14:34 INFO mapred. Jobclient:job Counters
11/03/31 17:14:34 INFO mapred. jobclient:launched Reduce Tasks=1
11/03/31 17:14:34 INFO mapred. jobclient:launched Map tasks=16
11/03/31 17:14:34 INFO mapred. Jobclient:data-local Map tasks=16
11/03/31 17:14:34 INFO mapred. Jobclient:filesystemcounters
11/03/31 17:14:34 INFO mapred. jobclient:file_bytes_read=2382257412
11/03/31 17:14:34 INFO mapred. jobclient:hdfs_bytes_read=1000057358
11/03/31 17:14:34 INFO mapred. jobclient:file_bytes_written=3402255956
11/03/31 17:14:34 INFO mapred. jobclient:hdfs_bytes_written=1000000000
11/03/31 17:14:34 INFO mapred. Jobclient:map-reduce Framework
11/03/31 17:14:34 INFO mapred. Jobclient:reduce input groups=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:combine Output Records=0
11/03/31 17:14:34 INFO mapred. Jobclient:map input records=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:reduce Shuffle bytes=951549012
11/03/31 17:14:34 INFO mapred. Jobclient:reduce Output records=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:spilled records=33355441
11/03/31 17:14:34 INFO mapred. Jobclient:map Output bytes=1000000000
11/03/31 17:14:34 INFO mapred. Jobclient:map input bytes=1000000000
11/03/31 17:14:34 INFO mapred. Jobclient:combine input Records=0
11/03/31 17:14:34 INFO mapred. Jobclient:map Output records=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:reduce input records=10000000
11/03/31 17:14:34 INFO Terasort. Terasort:done

Execution completed, sorted, and the resulting data is still 1G,

ROOT@GD38 hadoop-0.20.1# bin/hadoop Fs-ls terasort/1g-output
Found 2 Items
Drwxr-xr-x-root supergroup 0 2011-03-31 17:13/user/root/terasort/1g-output/_logs
-rw-r--r--1 root supergroup 1000000000 2011-03-31 17:13/user/root/terasort/1g-output/part-00000

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.