Spark subverts the sorting records maintained by MapReduce

Source: Internet
Author: User
Tags shuffle hadoop mapreduce

Spark subverts the sorting records maintained by MapReduce

Over the past few years, the adoption of Apache Spark has increased at an astonishing speed. It is usually used as a successor to MapReduce and can support cluster deployment on thousands of nodes. Apache Spark is more efficient than MapReduce in terms of data processing in memory. However, when the amount of data exceeds the memory capacity, we also hear the troubles of some organizations in Spark usage. Therefore, together with the Spark community, we have invested a lot of energy in improving Spark stability, scalability, and performance. Since Spark runs well on GB or TB-level data, it should also do the same for PB-level data.

To evaluate this, we recently completed an Sort Benchmark (Daytona Gray category) test with AWS, an industry Benchmark that considers the speed at which the system sorts TB of data (trillions of records. Before that, the world record holder of this benchmark was Yahoo, which completed computing within 72 minutes using a Hadoop MapReduce cluster with 2100 nodes. According to the test results, Spark reduced the sorting time to 23 minutes when 206 EC2 nodes were used. This means that Spark is three times faster than MapReduce in sorting the same data when one tenth is used to calculate resources!

In addition, without the official PB sorting comparison, we first pushed Spark to the sorting of 1 PB of data (100,000 billion records. The result of this test is that when 190 nodes are used, the workload is completed in less than 4 hours, it also far exceeds the record that Yahoo previously used 3800 hosts for 16 hours. At the same time, as far as we know, this is the first PB-level sorting test completed in the public cloud environment.


Hadoop World Record Spark 100 TB Spark 1 PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 6592 6080
# Reducers 10,000 29,000 250,000

1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Sort Benchmark Daytona Rules Yes Yes No
Environment Dedicated data center EC2 (i2.8xlarge) EC2 (i2.8xlarge)

Why is sorting selected?

The core of sorting is the shuffle operation. Data Transmission spans all hosts in the cluster. Shuffle basically supports all distributed data processing loads. For example, in an SQL query that connects two different data sources, shuffle is used to move the tuples that need to be connected to the same host, collaborative filtering algorithms like ALS also need to rely on shuffle to send ratings and weights of users or products in the network ).

Most data pipelines start with a large amount of raw data. However, as more and more irrelevant data are filtered or the intermediate data is expressed more concisely, the data volume is reduced. In the query of TB raw data, the data shuffled on the network may only be a small part of TB. This mode is also reflected in the name of MapReduce.

However, sorting is very challenging because the data volume in the Data Pipeline does not decrease. If the raw data of 100 TB is sorted, the data of shuffle in the network must be TB. At the same time, in the Daytona benchmark test, backup is required for both input and output data for Fault Tolerance purposes. In fact, in terms of sorting TB of data, we can produce TB of disk I/O and TB of network I/O.

Therefore, for the above reason, when we look for Spark measurement standards and improvement methods, sorting the most demanding workload is the best choice for comparison.

Technical implementation that produces such results

On ultra-large scale workloads, we have invested a lot of effort to improve Spark. From the details, there are three tasks highly related to this benchmark:

First and foremost, we introduced a brand new shuffle implementation in Spark 1.1, that is, sort-based shuffle (SPARK-2045 ). Previously, Spark implemented shuffle Based on hash. It needs to keep both P (number of reduce segments) buffers in the memory. In a sort-based shuffle, the system uses only one buffer zone at any time. This operation will significantly reduce memory overhead, so it can support hundreds of thousands of tasks in the same scenario (we use 100,000 tasks in PB sorting ).

Secondly, we revised the Spark network model and used the Netty-based Epoll local port for transmission through JNI (SPARK-2468. At the same time, the new model also has an independent memory pool, bypassing the JVM memory distributor, thus reducing the impact of garbage collection.

Last but equally important, we have established an external shuffle Service (SPARK-3796), which is completely decoupled from the executor of Spark itself. This new service is based on the network model described above, and it can still ensure that the shuffle File Processing continues when the executor of Spark is busy with GC processing.


With these three changes, our Spark cluster can support 3 GB of IO throughput per second for a single node in the map stage, and 1.1 GB for a single node in the reduce stage, to squeeze out 10 Gbps network bandwidth between these machines.

More technical details

TimSort: In Spark 1.1, we convert the default Sorting Algorithm from quicksort to TimSort, which is a derivative of merging sorting and embedding sorting. In most real-world data sets, TimSort is more efficient than quicksort, and better in some sorted data sets. TimSort is used in both the map and Reduce phases.

Cache position utilization: In the benchmark sorting test, the size of each record is 100 bytes, And the sorted key is the first 10 bytes. In the Performance Analysis Section of the sorting item, we noticed that the cache hit rate is not satisfactory, because each comparison requires a random object pointer query. To this end, we re-designed the layout of records in the memory and represented each record with a 16-byte length (two long integers. Here, the first 10 bytes represent the sort key, and the last 4 bytes represent the record position (this is not easy to find considering the byte order and symbol ). In this way, each comparison requires only one cache query, and they are continuous, thus avoiding random memory queries.

Using TimSort and the new layout method to use cache hits, the CPU time consumed by sorting is reduced by 5 times.

Large-scale Fault Tolerance Mechanism: many problems are exposed at a large scale. During this test, we can see that node loss due to network connectivity problems, Linux kernel spin, and node stagnation caused by memory fragmentation. Fortunately, Spark has a very good Fault Tolerance Mechanism and can be restored smoothly.

AWS energy: as described above, we used 206 i2.8xlarge instances to run this I/O intensive test. With SSD, these instances deliver very high I/O throughput. We place these instances in a VPC placement group to enhance network performance through a single SR-IOV for High Performance (10 Gbps), low latency and low jitter.

Can Spark only shine in memory?

This misunderstanding has always been around Spark, especially for new people who are new to the community. Yes, Spark is famous for its high performance in memory computing. However, Spark's original design intention and philosophy are a general big data processing platform-whether using memory or disk. When data cannot be fully stored in the memory, basically all Spark operators perform some additional processing. In general, the Spark operator is a super set of MapReduce.

As shown in this test, Spark can process datasets with N times the cluster memory size.

Summary

Defeating the large-scale data processing records created by the Hadoop MapReduce cluster is not only a proof of our work, but also a validation of Spark's commitment-Spark has more advantages in performance and scalability in any data volume. At the same time, we also hope that Spark will save both time and overhead during user usage.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.