Mass data ordering on the Hadoop platform (2)

Source: Internet
Author: User
Keywords nbsp; each checksum used to

&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; When using Hadoop for Graysort Benchmarking, Yahoo! Researchers slightly modified the map/reduce application above to accommodate the new rules, which are divided into 4 parts:

Teragen is the Map/reduce program that produces data;

Terasort data sampling and using Map/reduce to sort the data;

Terasum is a map/reduce program that calculates the CRC32 of each key/value pair, with a total of 128-bit checksums;

Teravalidate is a map/reduce program used to verify that the output data is ordered, and calculates the sum of the checksum.

Teragen and Terasort, as described above, teravalidate, in addition to the task of calculating output catalog checksum totals, are all the same.

Terasum calculates the CRC32 checksum of each key/value pair, each map task calculates the checksum output of the input, and then a reduce task adds the checksum generated by each map. This program is used to calculate the sum of each key/value pair checksum in the input directory and to check the correctness of the sorted output.

(click to view larger image) Figure 3-10 the number of tasks per phase

The benchmark was run on the hammer cluster in Yahoo!, with specific details of the cluster as follows:

Nearly 3,800 nodes (in such a large-scale cluster, some nodes will be broken);

Two dual-core 2.5GHz Xeons processors per node;

4 SATA hard drives per node;

8GB of memory per node (upgraded to 16GB before PB order);

1GB Ethernet bandwidth per node;

Each rack has 40 nodes;

Each node to the core has 8GB of Ethernet bandwidth;

The operating system is red Hat Enterprise Linux Server realease 5.1 (kernel 2.6.18);

The JDK is the Sun Java jdk (1.6.0 05-b13 and 1.6.0 13-b03) (bit).

For a larger order, here Namenode and Jobtracker use a 64-bit JVM. The Hadoop platform used for sorting tests has also made some changes, including:

The reducer part of the Hadoop shuffle phase is realized again, the shuffle performance is improved after the redesign, the bottleneck is relieved, and the code is easier to maintain and understand.

The new shuffle process retrieves the results of multiple maps from one node, rather than just one result at a time. This prevents unnecessary connection and transmission overhead;

Allows you to configure timeout times for shuffle connections, which can be reduced when sorted on a small scale, because in some cases shuffle will stop after the time-out expires, which can increase the delay time of the task;

Set TCP to have no latency and increase the frequency of ping between tasktracker and tasktracker to reduce latency in discovering problems;

Add code to detect the correctness of transmitting data from shuffle to prevent the failure of the reduce task.

When the map output is used Lzo compression, Lzo can compress 45% of the data volume;

In the shuffle phase, when the results of the map are aggregated and exported in memory, the memory-memory aggregation required by reduce is achieved, which reduces the workload of the reduce runtime;

Using multithreading to implement sampling process, and write a simpler distributor based on the average distribution of key values;

On smaller clusters, the configuration system has a faster heartbeat rate between tasktracker and jobtracker to reduce latency (default is 10 seconds/1000 nodes, CONFIGURED for 2 sec/1000 nodes);

The default Jobtracker assigns tasks according to the first-come first-served strategy, and this greedy task-allocation method does not distribute data well in the Tasktracker. From a global point of view, if the map is assigned a good task, the system will have a better distribution, but for all Hadoop programs to achieve a global scheduling strategy is very difficult, here only to achieve the Terasort global scheduling strategy;

Hadoop 0.20 adds the ability to install and purge tasks, but this is not required in a sort benchmark test and can be set to not start to reduce latency for start and end tasks;

Removed some hard-coded wait loops in the frame that are not related to larger tasks, because it increases the task latency time;

Allows you to set the level of the log for a task so that by configuring the log level, you can reduce the contents of the log from info to warn, and reduce the content of the log to improve the performance of the system, but increase the difficulty of debugging and analysis;

Optimize task assignment code, but not yet completed. Currently, it can take a lot of time to use RPC requests for input files to Namenode.

Hadoop has improved significantly compared to the above tests, and can perform more tasks in less time. It is noteworthy that large amounts of data need to be transferred in large clusters and distributed applications, which can lead to significant changes in execution time. But as Hadoop improves, it can handle hardware failures better, and this time change is trivial. The time required for sorting data of different sizes is shown in table 3-2.

Because smaller amounts of data require shorter latency and faster networks, some of the nodes in the cluster are used to compute. Set the output copy number of a smaller calculation to 1, because the entire process is shorter and runs on a smaller cluster, and the node is less likely to break. On a larger scale, it is inevitable that the nodes are broken, so the number of nodes is set to 2. HDFs guarantees that the data will not be lost after the node is replaced because different replicas are placed on different nodes.

Yahoo! 's researchers counted the changes in the number of tasks Jobtracker from task submissions over time, and figure 3-11, figure 3-12, figure 3-13, and figure 3-14 Show the number of tasks at each point in time. Maps has only one stage, while reduces has three phases: shuffle, merge, and reduce. Shuffle is the transfer of data from the maps, the merge is not needed in the test; the reduce phase is aggregated and written to the HDFs. If you compare these graphs to figure 3-6, you will find that the task is set up faster. Figure 3-6 in each heartbeat to establish a task, then all the tasks to build up to 40 seconds, now Hadoop each heartbeat can set a tasktracker, it is obvious that reducing the cost of task creation is very important.

When running large-scale data, the impact of data transfer on task performance is also very large. In PB-level data sequencing, each map handles 15GB of data instead of the default 128MB, and each reduce handles 50GB of data. If processed according to 1.5gb/map, it will take 40 hours to complete. Therefore, in order to increase throughput, it is important to increase the size of each block.

Fig. 3-11 The number of tasks can change over time when the data amount is 500GB figure 3-12 the number of tasks changes over time (click to view larger image) Figure 3-13 data is 100TB when the number of tasks changes over time (click to view larger image) Figure 3-14 Changes in number of tasks over time with a data volume of 1PB
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.