Author:zhankunlin
Date:2011-4-1
Key Words:hadoop, Terasort
< a > Terasort introduction
1TB sequencing is typically used to measure the data processing capabilities of a distributed data processing framework. Terasort is a sort job in Hadoop, and in 2008, Hadoop won the first place in the 1TB sort benchmark evaluation, taking 209 seconds.
< two > Related Materials
Hadoop mapreduce Scalability Test: http://cloud.csdn.net/a/20100901/278934.html
Using MPI to realize Hadoop:map/reduce Terasort http://emonkey.blog.sohu.com/166546157.html
Terasort algorithm analysis in Hadoop: http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/
1TB sort terasort:http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html for Hadoop
Sort benchmark:http://sortbenchmark.org/
Trir Tree: http://www.cnblogs.com/cherish_yimi/archive/2009/10/12/1581666.html
< three > experiment
(0) Source location
/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/src/examples/org/apache/hadoop/examples/terasort
(1) First execute Teragen generate data
[Root@gd86 hadoop-0.20.1]#/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen 1000000 Terasort/1000000-input
View the generated data
[Root@gd86 hadoop-0.20.1]#/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop fs-ls/user/root/terasort/ 1000000-input
Found 3 Items
Drwxr-xr-x-root supergroup 0 2011-03-31 16:21/user/root/terasort/1000000-input/_logs
-rw-r--r--3 root supergroup 50000000 2011-03-31 16:21/user/root/terasort/1000000-input/part-00000
-rw-r--r--3 root supergroup 50000000 2011-03-31 16:21/user/root/terasort/1000000-input/part-00001
Generate two data, each size is 50000000 B =
[Root@gd86 hadoop-0.20.1]# bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen terasort/1000000-input
Will generate two B data, plus a total of B = 1 kb
The resulting data row is 100B, the parameter 10 indicates that 10 rows are generated, and a total of 1000b;1,000,000 rows have 100,000,000 B = M;
Teragen is a two map to complete the data generation, each map generated a file, two file size of a total of eight m, each is a.
[Root@gd86 hadoop-0.20.1]# bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen 10000000 terasort/1G-input
This results in 1 G of data, which is divided into 16 blocks, since the data block is a single piece of data, and there are 64 map tasks when running Terasort.
[Root@gd86 hadoop-0.20.1]# bin/hadoop jar Hadoop-0.20.1-examples.jar Teragen 10000000 terasort/1G-input
Generating 10000000 using 2 maps with step of 5000000
11/04/01 17:02:46 INFO mapred. Jobclient:running job:job_201103311423_0005
11/04/01 17:02:47 INFO mapred. Jobclient:map 0% Reduce 0%
11/04/01 17:03:00 INFO mapred. Jobclient:map 19% Reduce 0%
11/04/01 17:03:01 INFO mapred. Jobclient:map 41% Reduce 0%
11/04/01 17:03:03 INFO mapred. Jobclient:map 52% Reduce 0%
11/04/01 17:03:04 INFO mapred. Jobclient:map 63% Reduce 0%
11/04/01 17:03:06 INFO mapred. Jobclient:map 74% Reduce 0%
11/04/01 17:03:10 INFO mapred. Jobclient:map 91% Reduce 0%
11/04/01 17:03:12 INFO mapred. Jobclient:map 100% Reduce 0%
11/04/01 17:03:14 INFO mapred. Jobclient:job complete:job_201103311423_0005
11/04/01 17:03:14 INFO mapred. Jobclient:counters:6
11/04/01 17:03:14 INFO mapred. Jobclient:job Counters
11/04/01 17:03:14 INFO mapred. jobclient:launched Map tasks=2
11/04/01 17:03:14 INFO mapred. Jobclient:filesystemcounters
11/04/01 17:03:14 INFO mapred. jobclient:hdfs_bytes_written=1000000000
11/04/01 17:03:14 INFO mapred. Jobclient:map-reduce Framework
11/04/01 17:03:14 INFO mapred. Jobclient:map input records=10000000
11/04/01 17:03:14 INFO mapred. Jobclient:spilled records=0
11/04/01 17:03:14 INFO mapred. Jobclient:map input bytes=10000000
11/04/01 17:03:14 INFO mapred. Jobclient:map Output records=10000000
(2) Perform terasort sorting
Execution of the Terasort program will execute 16 maptask
ROOT@GD38 hadoop-0.20.1# bin/hadoop jar Hadoop-0.20.1-examples.jar terasort terasort/1g-input terasort/1G-output
11/03/31 17:12:49 INFO Terasort. Terasort:starting
11/03/31 17:12:49 INFO mapred. Fileinputformat:total input paths to Process:2
11/03/31 17:13:05 INFO util. Nativecodeloader:loaded The Native-hadoop Library
11/03/31 17:13:05 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
11/03/31 17:13:05 INFO Compress. Codecpool:got brand-new Compressor
Making 1 from 100000 records
Step size is 100000.0
11/03/31 17:13:06 INFO mapred. Jobclient:running job:job_201103311423_0006
11/03/31 17:13:07 INFO mapred. Jobclient:map 0% Reduce 0%
11/03/31 17:13:20 INFO mapred. Jobclient:map 12% Reduce 0%
11/03/31 17:13:21 INFO mapred. Jobclient:map 37% Reduce 0%
11/03/31 17:13:29 INFO mapred. Jobclient:map 50% Reduce 2%
11/03/31 17:13:30 INFO mapred. Jobclient:map 75% Reduce 2%
11/03/31 17:13:32 INFO mapred. Jobclient:map 75% Reduce 12%
11/03/31 17:13:36 INFO mapred. Jobclient:map 87% Reduce 12%
11/03/31 17:13:38 INFO mapred. Jobclient:map 100% Reduce 12%
11/03/31 17:13:41 INFO mapred. Jobclient:map 100% Reduce 25%
11/03/31 17:13:44 INFO mapred. Jobclient:map 100% Reduce 31%
11/03/31 17:13:53 INFO mapred. Jobclient:map 100% Reduce 33%
11/03/31 17:14:02 INFO mapred. Jobclient:map 100% Reduce 68%
11/03/31 17:14:05 INFO mapred. Jobclient:map 100% Reduce 71%
11/03/31 17:14:08 INFO mapred. Jobclient:map 100% Reduce 75%
11/03/31 17:14:11 INFO mapred. Jobclient:map 100% Reduce 79%
11/03/31 17:14:14 INFO mapred. Jobclient:map 100% Reduce 82%
11/03/31 17:14:17 INFO mapred. Jobclient:map 100% Reduce 86%
11/03/31 17:14:20 INFO mapred. Jobclient:map 100% Reduce 90%
11/03/31 17:14:23 INFO mapred. Jobclient:map 100% Reduce 93%
11/03/31 17:14:26 INFO mapred. Jobclient:map 100% Reduce 97%
11/03/31 17:14:32 INFO mapred. Jobclient:map 100% Reduce 100%
11/03/31 17:14:34 INFO mapred. Jobclient:job complete:job_201103311423_0006
11/03/31 17:14:34 INFO mapred. Jobclient:counters:18
11/03/31 17:14:34 INFO mapred. Jobclient:job Counters
11/03/31 17:14:34 INFO mapred. jobclient:launched Reduce Tasks=1
11/03/31 17:14:34 INFO mapred. jobclient:launched Map tasks=16
11/03/31 17:14:34 INFO mapred. Jobclient:data-local Map tasks=16
11/03/31 17:14:34 INFO mapred. Jobclient:filesystemcounters
11/03/31 17:14:34 INFO mapred. jobclient:file_bytes_read=2382257412
11/03/31 17:14:34 INFO mapred. jobclient:hdfs_bytes_read=1000057358
11/03/31 17:14:34 INFO mapred. jobclient:file_bytes_written=3402255956
11/03/31 17:14:34 INFO mapred. jobclient:hdfs_bytes_written=1000000000
11/03/31 17:14:34 INFO mapred. Jobclient:map-reduce Framework
11/03/31 17:14:34 INFO mapred. Jobclient:reduce input groups=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:combine Output Records=0
11/03/31 17:14:34 INFO mapred. Jobclient:map input records=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:reduce Shuffle bytes=951549012
11/03/31 17:14:34 INFO mapred. Jobclient:reduce Output records=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:spilled records=33355441
11/03/31 17:14:34 INFO mapred. Jobclient:map Output bytes=1000000000
11/03/31 17:14:34 INFO mapred. Jobclient:map input bytes=1000000000
11/03/31 17:14:34 INFO mapred. Jobclient:combine input Records=0
11/03/31 17:14:34 INFO mapred. Jobclient:map Output records=10000000
11/03/31 17:14:34 INFO mapred. Jobclient:reduce input records=10000000
11/03/31 17:14:34 INFO Terasort. Terasort:done
Execution completed, sorted, and the resulting data is still 1G,
ROOT@GD38 hadoop-0.20.1# bin/hadoop Fs-ls terasort/1g-output
Found 2 Items
Drwxr-xr-x-root supergroup 0 2011-03-31 17:13/user/root/terasort/1g-output/_logs
-rw-r--r--1 root supergroup 1000000000 2011-03-31 17:13/user/root/terasort/1g-output/part-00000