Sorting of massive data on the hadoop Platform

Last Update:2018-12-05 Source: Internet

Author: User

Tags benchmark crc32 crc32 checksum

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Yahoo! Researchers used hadoop to complete the Jim Gray benchmark sorting, which contains many related benchmarks, each of which has its own rules. All sorting benchmarks are determined by measuring the sorting time of different records. Each record is 100 bytes. The first 10 bytes are keys, and the rest are numerical values. Minutesort compares the data size sorted within one minute, and graysort compares the sorting rate (TBS/minute) when sorting large-scale data (at least 100 TB ). The benchmark rules are as follows:

The input data must exactly match the data generated by the data generator;
When a task starts, the input data cannot be stored in the file cache of the operating system. In Linux, sorting programs need to use memory to exchange other content;
The input and output data are not compressed;
The output cannot overwrite the input;
The output file must be stored on the disk;
The CRC32 of each key/value pair of input and output data must be calculated, with a total of 128-bit checksum. Of course, the input and output must correspond to the same;
If the output is divided into multiple output files, they must be completely ordered, that is, they must be completely ordered after being connected;
The START and distribution programs should also be recorded in the computing time in the cluster;
Any sampling should also be recorded in the calculation time.

Yahoo! The researchers used hadoop to arrange 1 TB of data in 62 seconds and 1 Pb of data in 16.25 hours, as shown in Table 3-2. It won the Daytona graysort and minutesort competitions.

Table 3-2 data scale and sorting time

Data size (bytes)	Number of nodes	Number of sub-books	Time Interval
500 000 000 000	1406	1	59 seconds
1 000 000 000 000	1460	1	62 seconds
100 000 000 000 000	3452	2	173 minutes
1 000 000 000 000 000	3658	2	975 minutes

The following content is sorted by reference on the official website (http://sortbenchmark.org/) related to the use of hadoop sorting.

Yahoo! Researchers have compiled three hadoop applications to sort TB-level data:
Teragen is the map/reduce program that generates data;
Terasort samples data and uses MAP/reduce to sort the data;
Teravalidate is a map/reduce program used to verify the order of output data.

Teragen is used to generate data. It arranges data in rows and assigns tasks to each map based on the number of executed tasks. Each map task generates data within the range of allocated rows. Finally, teragen uses 1800 tasks to generate a total of 10 billion rows of data stored on HDFS. The size of each storage block is 512 MB.

Terasort is a standard MAP/reduce sorting program, but different allocation methods are used here. The program uses the N-1's sorted sampling key values to assign a range of rows for the Sort data to the reduce task. For example, the data of the key value Key in the range sample [I-1] <= Key <sample [I] is assigned to the I reduce task. This ensures that the data output by the I reduce task is smaller than that by the I + 1 reduce task. To accelerate the allocation process, the distributor creates two layers of index structure tree on the sample key value. Before the task is submitted, terasort samples the input data and writes the generated sampling data to HDFS. The input and output formats are specified so that the three applications can read and write data correctly. By default, the number of replicas of a reduce task is 3. Set this parameter to 1 because the output data does not need to be copied to multiple nodes. The tasks configured here are 1800 map tasks and 1800 reduce tasks, and ample space is set for the task stack to prevent the generated intermediate data from overflow to the disk. The sampling device uses 100
000 key values determine the border of the reduce task. As shown in 3-9, the distribution is not perfect.

Teravalidate ensures that all output data is sorted. It assigns a map task (3-10) to each file in the output directory. The map task checks whether each value is greater than or equal to the previous value, both the maximum and minimum values are output to the reduce task. The reduce task checks whether the minimum value of the I-th file is greater than the maximum value of the I-1 file. If not, an error report is generated.

Figure 3-9 output size and Completion Time Distribution of reduce tasks

The preceding application runs on a cluster built by Yahoo. The cluster configuration is as follows:

910 nodes;
Each node has four Intel dual-core GHz Xeon processors;
Each node has four SATA hard disks;
Each node has 8 GB of memory;
Each node has 1 GB Ethernet bandwidth;
One rack for 40 nodes;
Each rack to the core has 8 GB Ethernet bandwidth;
The operating system is red hatenterprise Linux Server Release 5.1 (kernel 2.6.18 );
JDK is Sun Java jdk1.6.0 _ 05-b13.

The entire sorting process is completed within 209 seconds (3.48 minutes). Although there are 910 nodes, the core of the network is shared with the clusters of the other 2000 nodes, therefore, the running time varies with the activity of other clusters.

Yahoo! The preceding MAP/reduce application was slightly modified to adapt to the new rules. The entire program is divided into four parts:

Teragen is the map/reduce program that generates data;
Terasort samples data and uses MAP/reduce to sort the data;
Terasum is a map/reduce program used to calculate the CRC32 of each key/value pair, a total of 128-bit checksum;
Teravalidate is a map/reduce program used to verify the order of output data and calculate the sum of the checksum.

Teragen and terasort are the same as described above. In addition to adding tasks for calculating the sum of the output directory checksum, teravalidate is the same.

Terasum calculates the CRC32 checksum of each key/value pair. Each map task calculates the input checksum and outputs, and then a reduce task adds the checksum generated by each map. This program is used to calculate the sum of each key/value pair in the input directory, and to check the correctness of the sorted output.

Figure 3-10 number of tasks in each stage

This benchmark test runs on Yahoo! On the hammer cluster, the details of the cluster are as follows:

Nearly 3800 nodes (some nodes will break down in such a large-scale cluster );
Each node has two dual-core GHz xeons processors;
4 SATA hard disks per node;
8 GB memory for each node (16 GB memory will be upgraded before Pb-level sorting );
1 GB Ethernet bandwidth for each node;
Each rack has 40 nodes;
Each node to the core has 8 GB Ethernet bandwidth;
The operating system is red hatenterprise Linux Server realkernel 5.1 (kernel 2.6.18 );
JDK is Sun Java JDK (1.6.0 05-b13and 1.6.0 13-b03) (32 and 64 bit ).

For large-scale sorting, namenode and jobtracker use 64-bit JVM. The hadoop platform used for sorting testing has also made some changes, mainly including:

The reducer part of the hadoopshuffle stage is re-implemented. After the re-design, the shuffle performance is improved, the bottleneck is eliminated, and the code is easier to maintain and understand;
The new shuffle process obtains the results of multiple maps from one node, instead of obtaining only one result at a time. This prevents redundant connections and transmission overhead;
Allows you to configure the shuffle connection timeout time, which can be reduced in small-scale sorting, because in some cases, shuffle will stop after the timeout time expires, which increases the task latency;
Set TCP to no latency and increase the ping frequency between tasktracker and tasktracker to reduce the delay time for Problem Discovery;
Add some code to check the correctness of data transmitted from shuffle to prevent the failure of reduce tasks.
Lzo is used to compress the map output. lzo can compress 45% of the data volume;
In the shuffle stage, when map results are aggregated and output in the memory, the memory required by reduce is aggregated to the memory, which reduces the workload during the reduce operation;
Use multiple threads to implement the sampling process, and write a simple distributor based on the average distribution of key values;
In small-scale clusters, the configuration system has a fast heartbeat frequency between tasktracker and jobtracker to reduce latency (the default value is 10 S/1000 nodes, 2 S/1000 nodes );
The default jobtracker allocates tasks for tasktracker according to the first-come-first-served policy. This greedy task allocation method does not well distribute data. From a global perspective, if tasks are assigned to the map at one time, the system will have a good distribution, but it is very difficult to implement a global scheduling policy for all hadoop programs, the global scheduling policy of terasort is implemented here;
Hadoop 0.20 adds the function of installing and clearing tasks, but this is not required in the sorting benchmark test. You can set it to not start to reduce the latency of starting and ending tasks;
Some hard coding wait loops irrelevant to large tasks in the framework are deleted, because they increase the task latency;
You can set the log level for the task. By configuring the log level, you can reduce the log Content from info to warn. Reducing the log Content greatly improves the system performance, however, debugging and analysis are more difficult;
The task allocation code is optimized, but not completed yet. Currently, it takes a lot of time to use RPC requests to input files to namenode.

Hadoop has made great improvements compared with the above tests, and can execute more tasks in a shorter time. It is worth noting that a large amount of data needs to be transferred in large sets and distributed applications, which leads to a great change in the execution time. However, with the improvement of hadoop, it can better handle hardware faults, and this time change is negligible. The time required for sorting data of different scales is shown in Table 3-2.

Because small-scale data requires shorter latency and faster network, some nodes in the cluster are used for computing. Set the number of output replicas for small-scale computing to 1, because the entire process is short and runs on a small cluster, the possibility of node failure is relatively small. In large-scale computing, it is inevitable that the node breaks down, so the number of node replicas is set to 2. HDFS ensures that data will not be lost after the node is replaced, because different copies are placed on different nodes.

Yahoo! The researchers analyzed the changes in the number of tasks obtained from the task submission status on jobtracker over time, figure 3-11, figure 3-12, Figure 3-13, and Figure 3-14 show the number of tasks at each time point. MAPS has only one phase, while reduces has three phases: shuffle, merge, and reduce. Shuffle transfers data from maps, and merge does not need it during testing. The reduce stage performs the final aggregation and writes it to HDFS. If you compare these figures with Figure 3-6, you will find that the task creation speed is faster. In Figure 3-6, it takes 40 seconds to create a task for each heartbeat task. Now, a tasktracker can be set for each heartbeat task of hadoop. It is very important to reduce the overhead of task creation.

Figure 3-11 when the data volume is GB, the number of tasks changes over time

Figure 3-12 the number of tasks changes with time when the data volume is 1 Tb

Figure 3-13 when the data volume is TB, the number of tasks changes over time

Figure 3-14 when the data volume is 1 Pb, the number of tasks changes over time

When running large-scale data, the number of data transfers has a significant impact on task performance. In Pb-level data sorting, each map processes 15 GB of data instead of the default 128 MB. Each reduce processes 50 GB of data. It takes 40 hours to complete the processing according to 1.5 GB/map. Therefore, it is very important to increase the size of each block to increase the throughput.

----------------------------------

This article is excerpted from "hadoop practice" Chapter 3rd "sorting of massive data on the hadoop platform"

Author: Lu jiaheng

Sample reading download http://download.csdn.net/detail/hzbooks/3704577

Visit this book official website http://datasearch.ruc.edu.cn/HadoopInAction/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More