According to sort Benchmark's latest news, Databricks's spark tritonsort two systems at the University of California, San Diego, 2014 in the Daytona graysort tied sorting contest. Among them, Tritonsort is a multi-year academic project, using 186 EC2 i2.8xlarge nodes in 1378 seconds to complete the sorting of 100TB data, while Spark is a production environment general-purpose large-scale iterative computing tool, it uses 207 EC2 The I2.8xlarge node sorted 100TB data in 1406 seconds, which we described in detail in the previous article.
In order to better understand this game, and the current spark community in some of the hot issues, the author interviewed Databricks Sing (Reynold Xin, @hashjoin). (PS: Thank you @ Hui-hu Technical support)
The following is the original interview
CSDN: The rules of this competition? What are the considerations?
Sing: This competition was first proposed by Jim Gray (the Turing prize winner who made an indelible contribution to the database field) in the 80 's to measure the improvement of computer software and hardware performance optimization. There are several different categories of the competition, the most challenging of which is to measure how quickly the system can sort a certain amount of data.
At the earliest, Jim Gray's rules of the game required contestants to sort 100MB of data, up to 1TB in 2001 years. After Jim Gray lost his voyage to sea in 2007, the competition was held by a committee. 2009 to commemorate Jim Gray, the most challenging category renamed to Daytona Graysort, and the amount of data raised to 100TB. In addition, this category has a number of stringent rules, such as all the sorted output must be replicated on different nodes, so that the storage data can tolerate node downtime, the sorting system must be able to support arbitrary length, and the ordering of extremely uneven distribution of data.
The competition committee is very serious and will conduct a one-month review of the system and technical reports. Detailed rules can be found in the official contest page: http://sortbenchmark.org/FAQ-2014.html
The competition system is typically derived from large companies (Microsoft, Yahoo, and the same year's tandem, DEC) or academic institutions (UC Berkeley, UCSD, University of California, San Diego). There are a number of competitors to improve performance will be specifically for this competition special hardware systems and software systems.
What kind of achievement did Csdn:spark get the first place in the competition? How do you compare with other contestants?
Sing: Our spark based system uses 207 Amazon EC2 virtual machines to sort 100TB of data in 23 minutes. Last year's champion Hadoop used 2100 Yahoo built-in machines and spent 72 minutes. By contrast, we use less than one-tenth of the machines, and the sorting speed is three times times that of Hadoop records. It is noteworthy that this was the first time in the history of the game that a system based on public cloud had been obtained.
The competition committee has told us that the system is very numerous every year, but because the competition will only inform the champions, we do not know how many other contestants there are.
This year there are two systems tied: Databricks's spark and UCSD's Themis all took about 23 minutes. Themis is a multi-year academic project that specializes in how to efficiently shuffle data and sequencing, sacrificing many of the features that a common system needs, such as fault tolerance and so on. Spark as a general-purpose system, able to play in a sort of game and UCSD Themis tied is a very difficult thing. One interesting thing: Professor George Porter, who led the Themis team, was also a Ph. D. Berkeley graduate, so the last two Berkeley alumni tied, hehe.
CSDN: What kind of features allow Spark to achieve such excellent results, whether it can be analyzed from a technical point of view?
Sing: This score is mainly attributable to three: our early commitment to spark engineering, spark flexibility, and our team's own experience with large-scale system optimization.
After the establishment of databricks, we have increased our investment in spark engineering systems, and many resources have been used to improve the performance of shuffle. When it comes to sorting, the most important step is shuffle, which has had a big impact on the promotion of shuffle in the last three jobs:
The first is sort-based shuffle. This feature greatly reduces the amount of memory occupied by shuffle, so that we can use more memory to sort. The second is the new Netty based network module, which replaces the original NIO network module. This new module improves the performance of network transmissions, and it manages memory from the GC itself, reducing the GC frequency. The third is a external shuffle service independent of spark executor. In this way, other nodes in the GC can also crawl shuffle data through the service, so the network transmission itself is not affected by the GC executor.
In the past, some of the system software has not been able to deal with hardware bottlenecks, even less than 10% of the utilization of hardware. And this time our competition system is full of 3gb/s hard disk bandwidth during the map, reached the bottleneck of eight SSD on these virtual machines, during reduce network utilization to 1.1gb/s, close to physical limit.
It took us less than three weeks to prepare for the match. This and Spark's own architectural design flexibility allows us to quickly implement some new algorithms and optimization is closely related.
CSDN: Support for SQL. SQL on Spark is a long talk about the problem, the previous phase of the termination of shark, and open the Spark SQL project, can you specifically explain why? In addition, what is the plan for Spark SQL? What about the current support for SQL? When will the SQL92 or above standards be met?
Sing: Shark's reliance on hive is too strong, and Hive's own design is worse, with a lot of legacy code, making shark update on new features very slow. In the middle of last year Michael Armbrust (Spark SQL main designer) designed F1 's next generation of query optimizer inside Google. At that time he had a new design idea (Catalyst), and after we communicated with him, we felt that the new architecture had been based on the results of academic and industrial research over the past 30 years, coupled with his own new interpretation, which was more flexible and had a great architectural advantage than the traditional architecture. It took several months for us to finally convince Michael to join Databricks and start spark SQL development.
Spark SQL is now probably the largest big Data SQL Open source project, although from open source to now less than half a year, there are nearly 100-bit code contributors. Like Spark's flexibility, the Spark SQL architecture allows the open source community to quickly iterate, contribute new features, and many similar SQL92 features are of interest to many contributors to the open source community and should be implemented quickly.
CSDN: About computing. When running spark, the intermediate results of the application will pass through the disk, which is bound to affect performance, while the industry Li Haoyuan Tachyon can be stripped of the spark, and the HDFs file system has a good support, without changing the user use situation greatly improve performance, now also by the Intel, Support from companies such as EMC to develop well in the spark biosphere. So what is Databricks's plan for this? To offer more native support, or to upgrade yourself?
Sing: Spark intermediate results are most often passed directly from upstream operator to downstream operator and do not need to go through the disk. The intermediate results of the shuffle will be saved on disk, but as we optimize the shuffle, the disk itself is not a bottleneck. The competition also verifies that the real bottleneck in shuffle is the network, not the disk.
Tachyon confirms the general trend that storage systems should make better use of memory. I predict that in the future more and more storage systems will have this consideration and design, the principle of the spark project is to be able to better use the lower storage system, so we will also support this.
It is noteworthy that the shuffle data into the Tachyon or HDFs cache (HDFs new features) is not a good optimization model. The reason is that each block of data in shuffle itself is very small, and the amount of metadata is very much. Writing shuffle data directly to Tachyon or HDFS this distributed storage system will most likely directly overwhelm the metadata storage of these systems, causing performance degradation.
CSDN: Algorithm considerations. The core of large data in data modeling and data mining, so for the algorithm players, the R and other language support is undoubtedly necessary. As far as I know, the current Spark 1.1 distribution does not include SPARKR, so what is the roadmap?
Sing: Sparkr is an important step in the spark ecosystem into the traditional data scientist circle. Databricks and Alteryx announced a partnership to develop SPARKR a few months ago. This project is not spark itself primarily because of the project license (license). R's License and Apache 2.0 conflict, so Sparkr should be in the form of an independent project in the short term.
CSDN: Data Warehouse interoperability. It says the calculation of the data, so where does the calculation of the data go? What are the common data warehouses that you see in your work that users use? Cassandra or something? What data warehouses are spark more optimistic about? What NoSQL? Has there been a plan to get through the data warehouse and provide a more native support, and what is the trend here?
Sing: Like the attitude to the storage system, spark should not limit the user's use of the database. Spark's design allows him to easily support different storage formats and storage systems. We would like to have native support for several of the hottest databases, such as Cassandra.
In Spark 1.2 we will open a new storage interface (API), which enables the external storage system and database to be easily connected to the spark SQL Schemardd, And in the query optimizer can even directly send some filtered filter directly to the database to implement this interface, the maximum use of the database itself to reduce the filtering function of network transmission.
At present, some of our internal storage formats and system implementations (e.g., JSON, Avro) have shifted to this new interface. 1.2 has not yet been released, but many community members have begun to implement different databases. I expect that most of the future databases will be integrated through this interface and spark SQL, making spark SQL a unified query layer, even using data from multiple different databases in a single query.
Free Subscription "CSDN cloud Computing (left) and csdn large data (right)" micro-letter public number, real-time grasp of first-hand cloud news, to understand the latest big data progress!
CSDN publishes related cloud computing information such as virtualization, Docker, OpenStack, Cloudstack, data center, sharing Hadoop, Spark, Nosql/newsql, HBase, Impala, Large data viewpoints, such as memory calculation, stream computing, machine learning and intelligent algorithms, provide services such as cloud computing and large data technology, platform, practice and industry information. &NBSP