Over the past few years, the use of Apache Spark has increased at an alarming rate, often as a successor to MapReduce, which can support a thousands of-node-scale cluster deployment. In-memory data processing, Apache Spark is much more efficient than mapreduce, but when the amount of data is far beyond memory, we also hear about some of the agencies ' problems with spark usage. So, together with the spark community, we've invested a lot of effort to improve spark stability, scalability, performance, and more. Since spark works well on gigabytes or terabytes of data, it should do the same on petabytes of data.
To evaluate these efforts, we recently completed a sort Benchmark (Daytona Gray category) test with AWS, an industry benchmark that considers the speed of the system to sort 100TB data (trillion records). Until then, the world record-keeping of the benchmark was Yahoo, which was calculated within 72 minutes using a 2100-node Hadoop MapReduce cluster. According to the test results, spark shortened the sequencing time to 23 minutes when 206 EC2 nodes were used. This means that with one-tenth computing resources, Spark is 3 times times faster than MapReduce on the same sort of data!
In addition, in the absence of an official PB sort comparison, we first pushed spark to the 1PB data (10 trillion records) sort. The result of this test is that with 190 nodes, the workload is completed in less than 4 hours, as well as the 16-hour record of using 3800 consoles before Yahoo. At the same time, as far as we know, this is the first Petabyte-level sequencing test for a public cloud environment.
|
Hadoop World Record |
Spark TB |
spark 1 PB |
data Size |
102.5 TB |
100 TB |
1000 TB |
elapsed time |
72 mins |
23 mins |
234 mins |
# Nodes |
2100 |
206 |
190 |
# cores |
50400 |
6592 |
6080 |
# reducers |
10,000 |
29,000 |
250,000 |
|
1.42 tb/min |
4.27 tb/min |
4.27 tb/min |
rate/node |
0.67 gb/min |
20.7 gb/min |
22.5 gb/min |
sort Benchmark Daytona Rules |
yes |
yes |
no |
environment |
dedicated data center |
EC2 (i2.8xlarge) |
EC2 (i2.8xlarge) |
Why would I choose sort?
The core of the sequencing is the shuffle operation, which transmits data across all hosts in the cluster. The shuffle basically supports all distributed data processing loads. For example, in a SQL query that connects two different data sources, a shuffle will be used to move the tuples that need to connect the data to the same host, and a collaborative filtering algorithm like ALS also needs to rely on Shuffle sends the user or product's rating (ratings) and weight (weights) on the network.
Most data pipelines start with a lot of raw data, but in pipeline processing, as more and more irrelevant data is filtered, or the intermediate data is more concise, the amount of data will inevitably decrease. On the 100TB raw data query, shuffle data on the network may only be a small part of 100TB, this pattern is also reflected in the name of the MapReduce.
However, sorting is challenging because the amount of data in the data pipeline is not reduced. If the original data of 100TB is sorted, the shuffle data in the network must also be 100TB. At the same time, in the Daytona type of benchmark test, in order to fault tolerance, whether it is input data or output data need to do backup. In fact, we may generate 500TB of disk I/O and 200TB of network I/O on the 100TB data sort.
So, for these reasons, when we look for Spark's measurement standards and lifting methods, the most demanding workload for sequencing is the perfect contrast.
the technical implementation that produces such results
We've put a lot of effort into boosting spark on ultra-large workloads. In terms of detail, there are 3 major work related to this benchmark:
First and foremost, in SPARK 1.1 we introduced a new shuffle implementation, which is a sort-based shuffle (SPARK-2045). Before that, Spark was doing a hash-based shuffle implementation, which needed to maintain a buffer of P (reduce's split number) in memory simultaneously. Under sort-based shuffle, the system uses only one buffer at any time. This operation will significantly reduce memory overhead, so you can support hundreds of thousands of tasks in the same scenario (we used 25,000 tasks in the PB sort).
Second, we revised the SPARK's network model to use the Netty-based Epoll local port transfer via JNI (SPARK-2468). At the same time, the new model has a separate memory pool that bypasses the JVM's memory allocator, reducing the impact of garbage collection.
Last but not least, we have created an external shuffle service (SPARK-3796) that is completely decoupled from the actuator of SPARK itself. This new service is based on the network model described above, and while the executor of spark itself is busy with GC processing, it can still guarantee the continuation of the shuffle file processing.
With these three changes, our spark cluster can support up to 3GB of Io throughput per second in the map phase, and a single node in the reduce phase can support 1.1GB, thus squeezing the 10Gbps network bandwidth between these machines.
more technical details
Timsort: in the Spark 1.1 release, we converted the default sorting algorithm from Quicksort to Timsort, which is a derivative of the merge sort and the embedded sort. In most real-world datasets, Timsort is more efficient than quicksort and is better at partial sorting data. We used timsort in both the map phase and the reduce phase.
Cache Location Utilization: in a sort benchmark, each record is 100 bytes in size, while the sorted key is the first 10 bytes. During the profiling phase of the sequencing project, we noticed that the cache hit ratio was not as good as it was because each comparison required a random object pointer query. To do this, we redesigned the layout of the memory to represent each record with a record of 16 bytes in length (two long shaping). Here, the first 10 bytes represent the sorted key, and the last 4 bytes represent the location of the record (this is not easy to spot in view of the byte order and the symbol). In this way, each comparison requires only one cache query, and they are contiguous, thus avoiding random memory queries.
Using Timsort and the new layout to take advantage of cache hits, sorting takes up 5 times times more CPU time.
large-scale fault-tolerant mechanism: in large-scale, many problems will be exposed. During this test, we saw node loss due to network connectivity problems, spin of the Linux kernel, and node stagnation due to memory defragmentation. Fortunately, Spark's fault-tolerant mechanisms are very good and fail-back smoothly.
AWS Energy: as described above, we used 206 I2.8xlarge instances to run this I/O intensive test. With SSDs, these instances deliver very high I/O throughput. We put these instances into a VPC placement group, which enhances network performance with single SR-Iov for high performance (10Gbps), low latency, and low jitter.
Spark can only shine in memory?
This misunderstanding has been around spark, especially the new entrants into the community. Yes, Spark is known for high-performance memory computing, but Spark's design intent and philosophy is a common, big data-processing platform-whether it's using memory or disk. Basically all the spark operators do some extra processing when the data cannot be completely put into memory. In layman's terms, the spark operator is a superset of MapReduce.
As shown in this test, spark is capable of processing data sets at n times the size of the cluster memory.
Summary
Defeating the massive data-processing records created by the Hadoop mapreduce cluster is not only a testament to our work, but also a validation of the spark promise--spark has a greater performance and scalability advantage in any data volume. At the same time, we also hope that spark can lead to a double saving in the process of user use.
Spark subvert MapReduce maintained sorting records