Hadoop platform on the mass data sorting (1)

Source: Internet
Author: User
Keywords Each must massive data here
Tags *.h file based data different distribution file hadoop key

& nbsp; Yahoo! researchers completed a Jim Gray benchmark sort using Hadoop, which contains many related benchmarks, each benchmarking its own rules All the sort base is by measuring the sorting time of different records to develop, each record is 100 bytes, of which the first 10 bytes is the key, the remaining part is the value .MinuteSort is sorted in less than a minute , GraySort compares the sort rate (TBs / minute) when sorting large-scale data (at least 100TB). The rules of the base rule are as follows:

The input data must exactly match the data generated by the data generator;

At the beginning of the task, the input data can not be in the operating system's file cache. In the Linux environment, the sort program need to use memory to exchange other content;

Input and output data are not compressed;

Output can not be rewritten input;

The output file must be stored on disk;

CRC32 of each key / value pair of input and output data must be calculated, a total of 128-bit checksum, of course, the input and output must be equal;

Output is divided into multiple output files, it must be fully ordered, that is, these output files must be connected after the output is fully ordered;

Start and distribute the program to the cluster should also be credited to the calculation of time;

Any sampling should also be credited to the calculation of time.

Yahoo! researchers used Hadoop to arrange 1TB of data in 62 seconds and 1PB of data in 16.25 hours, as shown in Table 3-2, which won the Daytona class GraySort and MinuteSort levels.

Table 3-2 Data Size and Sorting Time

The following is based on the benchmarking of the official website (http://sortbenchmark.org/) on the use of Hadoop sorting related content organized.

Yahoo! researchers have written three Hadoop applications to sort terabytes of data:

TeraGen generates data map / reduce program;

TeraSort samples the data and uses map / reduce to sort the data;

TeraValidate is used to verify the output data is ordered map / reduce procedures.

TeraGen is used to generate data that arranges the data in rows and assigns tasks to each map based on the number of tasks executed, each map task producing data in the range of allocated rows. Finally, TeraGen uses 1800 tasks to generate a total of 10 billion lines of data on HDFS, each 512 MB in size.

TeraSort is the standard map / reduce sorting program, but here's a different allocation method. The program uses N-1 sorted sample keys to assign the range of row numbers to the reduce task for sorted data. For example, key-value key data in the range sample [i-1] <= key <sample [i] is assigned to the ith reduce task. This ensures that the ith reduce task outputs less data than the i + 1 reduce task. In order to speed up the distribution process, the distributor establishes a two-level index tree on the sample key. TeraSort samples the input data before it is submitted and writes the resulting sample data to HDFS. The input and output formats are specified here so that three applications can read and write data correctly. Reduce the number of copies of the default task is 3, set to 1 here, because the output data does not need to be copied to multiple nodes. The tasks configured here are 1800 map tasks and 1800 reduce tasks, with sufficient space for the task's stack to prevent the resulting intermediate data from overflowing to disk. The sampler uses 100,000 key values ​​to determine the reduce task boundary, as shown in Figure 3-9, and the distribution is not perfect.

TeraValidate ensures that the output data is all sorted in the same order. It assigns a map task to each file in the output directory (as shown in Figure 3-10). The map task checks whether each value is greater than or equal to the previous value and outputs the maximum value And the minimum to the reduce task. The reduce task checks whether the minimum value of the i-th file is greater than the maximum value of the i-1 file, and if not, an error report is generated.

Figure 3-9 reduce task output size and completion time distribution


The above application runs on a cluster built by Yahoo, its cluster configuration is:

910 nodes

Each node has four Intel dual-core 2.0GHz Xeon processors;

Each node has 4 SATA hard drives;

Each node has 8GB of memory;

Each node has 1GB of Ethernet bandwidth;

40 nodes a rack;

Each rack to the core of 8GB of Ethernet bandwidth;

Operating System for Red Hat Enterprise Linux Server 920.html "> Release 5.1 (kernel 2.6.18);

The JDK is Sun Java JDK 1.6.0_05-b13.

The entire ordering process was completed in 209 seconds (3.48 minutes). Despite having 910 nodes, the core of the network was shared with clusters of 2000 other nodes, so the runtime could vary based on the activities of other clusters.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.