In the past few years, the use of Apache Spark has increased at an alarming rate, usually as a successor to the MapReduce, which can support thousands of-node-scale cluster deployments. In the memory data processing, the Apache spark is more efficient than the mapreduce has been widely recognized, but when the amount of data is far beyond memory capacity, we also hear some organizations in the spark use of trouble. Therefore, with the spark community, we have invested a lot of energy to do spark stability, scalability, performance and other aspects of the promotion. Since spark works well on gigabytes or terabytes of data, it should be the same in PB-level data.
To evaluate these efforts, we recently completed a sort Benchmark (Daytona Gray category) test with AWS, an industry benchmark that considers the speed of system-ordered 100TB data (trillions of records). Before that, the world record-keeping of the benchmark was Yahoo, using the 2100-node Hadoop mapreduce cluster to complete the calculations in 72 minutes. According to the test results, when the 206 EC2 nodes were used, spark shortened the ordering to 23 minutes. This means that in the case of using One-tenth computing resources, Spark is 3 times times faster than MapReduce for the same sort of data.
In addition, in the absence of official PB-ordering comparisons, we first pushed spark to the sort of 1PB data (10 trillion records). The result of this test is that, in the case of using 190 nodes, the workload is completed in less than 4 hours, as well as the 16-hour record of using 3800 hosts before Yahoo. At the same time, as far as we know, this is the first time the public cloud environment to complete the PB-level sorting test.
Hadoop world Record Spark terabytes Spark 1 PB Data Size 102.5 TB TB 1000 tb Elapsed time mins mins 234 # mins 2100 206 190 # Cores 50400 6592 6080 # reducers 10,000 29,000 250,000 1.42 tb/min 4.27 tb/min 4.27 tb/min rate/node 0.67 GB/min 20.7 GB/min 22.5 gb/min Sort Benchmark Daytona Rules Yes No Environnement dedicated data center EC2 (i2.8xlarge) EC2 (i2.8xlarge)
Why would you choose a sort?
The core of the sort is the shuffle operation, where the data is transmitted across all the hosts in the cluster. Shuffle basically supports all distributed data processing loads. For example, in a SQL query that connects two different data sources, you use shuffle to move the tuples that need to connect the data to the same host; Cooperative filtering algorithms such as ALS also need to rely on shuffle to send user or product ratings (Fitch) and weights (weights) in the network.
Most data pipelines start with a large amount of raw data, but in the pipeline process, as more and more irrelevant data is filtered, or the intermediate data is more succinctly represented, the amount of data is bound to decrease. In the 100TB original data query, the network shuffle data may only be a small part of 100TB, this model is also reflected in the MapReduce name.
However, sorting is challenging because the amount of data in the data pipeline is not reduced. If the original data of 100TB is sorted, the shuffle data in the network must also be 100TB. At the same time, in the Daytona type of benchmark test, for fault tolerance, both input data and output data need to be backed up. In fact, we may have 500TB of disk I/O and 200TB of network I/O on the 100TB data sort.
Therefore, for the above reasons, when we look for spark metrics and lifting methods, the most demanding workloads in the ranking are the best choice for comparison.
The technology that produces this result
We put a lot of effort into raising spark on the scale of the huge workload. In detail, there are 3 major tasks associated with this benchmark test:
First and foremost, in Spark 1.1 we introduced a new shuffle implementation, the shuffle based on sorting (SPARK2045). Before that, Spark did a shuffle implementation based on hashing, which needed to keep the buffer of P (reduce) in memory at the same time. In shuffle, the system uses only one buffer at any time. This operation will significantly reduce memory overhead, so the same scenario can support hundreds of thousands of tasks (we used 25,000 tasks in the PB sort).
Second, we revised the Spark network model to use Netty based Epoll local port transport via JNI (SPARK2468). At the same time, the new model also has a separate memory pool, bypassing the JVM's memory allocator, thereby reducing the impact of garbage collection.
Last but not least, we have created an external shuffle service (SPARK3796) that is fully decoupled from the spark itself. This new service is based on the network model described above, while the spark itself is still able to guarantee the continuation of shuffle file processing while the executor is busy with GC processing.
With these three changes, our spark cluster can support 3GB of IO throughput per second in the map phase, and the single node in the reduce phase can support 1.1GB to drain the 10Gbps network bandwidth between these machines.
More technical details
Timsort: In Spark 1.1, we converted the default sort algorithm from Quicksort to Timsort, which is a derivation of the merge sort and the embedded sort. In most real-world datasets, Timsort are more efficient than quicksort and perform better in partially sorted data. We used timsort both in the map phase and the reduce phase.
Cache location Utilization: In the sort benchmark, the size of each record is 100 bytes, while the ordered key is the first 10 bytes. In the performance analysis phase of the sorting project, we noticed that the cache hit rate was unsatisfactory because each comparison required a random object pointer query. To do this, we redesigned the layout of the records in memory to represent each record with a 16-byte length (two long shaping) record. Here, the first 10 bytes represent the sorted key, and the last 4 bytes represent the location of the record (this is not easy to find, given the byte order and symbols). As a result, each comparison only needs to do a cache query, and they are all contiguous, thus avoiding random memory queries.
Using Timsort and a new layout to take advantage of cache hits, the sort takes up 5 times times more CPU time.
Large-scale fault-tolerant mechanisms: on a large scale, many problems are exposed. During this test, we saw the loss of nodes due to network connectivity problems, the Linux kernel spin, and the node stagnation due to memory defragmentation. Fortunately, Spark's fault-tolerant mechanism is very good, and the failure recovery is smooth.
AWS Energy: As mentioned above, we use 206 I2.8xlarge instances to run this I/O intensive test. Through SSD, these instances deliver very high I/O throughput. We put these examples into a VPC drop group to enhance network performance through a single iov to achieve high performance (10Gbps), low latency, and low jitter.
Spark can only shine in memory?
This misconception has been around spark, especially new entrants to the community. Yes, Spark is well known for its high performance in memory computing, but Spark's original design and philosophy is a general-purpose large data-processing platform-whether it's using memory or disk. Basically all the spark operators do some extra processing when the data cannot be fully put into memory. In layman's terms, the spark operator is a superset of the MapReduce.
As shown in this test, spark can handle the data set with n times the size of the cluster memory.
Summary
Defeating the massive data-processing records created by the Hadoop mapreduce cluster is not only a testament to our work, but also a validation of the spark commitment----------in any volume of data, Spark is more advantageous in performance and scalability. At the same time, we also hope that in the user's use process, spark can bring time and cost of double savings.
Blog Links: Spark Breaks Previous large-scale Sort record (translation/Dongyang Zebian/Zhonghao)
More spark and Hadoop information can be paid attention to December 2014 12-14th in Beijing, 2014 China Large Data Technology conference (and the second CCF large data conference), then Baiyu domestic and foreign top technical personnel, academic masters will be sent to the first hand of practical sharing.
Free Subscription "CSDN cloud Computing (left) and csdn large data (right)" micro-letter public number, real-time grasp of first-hand cloud news, to understand the latest big data progress!
CSDN publishes related cloud computing information, such as virtualization, Docker, OpenStack, Cloudstack, and data centers, sharing Hadoop, Spark, Nosql/newsql, HBase, Impala, memory calculations, stream computing, Machine learning and intelligent algorithms and other related large data views, providing cloud computing and large data technology, platform, practice and industry information services.