Transfer from infoq!
According to the O ' Reilly 2016 Data Science Payroll survey, SQL is the most widely used language in the field of data science. Most projects require some SQL operations, and even some require only SQL.
This article covers 6 open source leaders: Hive, Impala, Spark SQL, Drill, Hawq, and presto, plus calcite, Kylin, Phoenix, Tajo, and Trafodion. and 2 commercially available options Oracle Big Data SQL and IBM Big SQL,IBM have not renamed the latter to "Watson SQL".
(Some readers asked: Druid?) My answer is: After checking, I agree that Druid belongs to this category . )
using the word SQL engine is a bit arbitrary. For example, hive is not an engine, its framework uses MapReduce, TeZ, or spark engine to execute queries, and it does not run SQL, but HIVEQL, a SQL-like language, very close to SQL. "Sql-in-hadoop" also does not apply, although hive and Impala use Hadoop primarily, but Spark, Drill, HAWQ, and Presto can also be used with a variety of other data storage systems.
Unlike relational databases, the SQL engine is independent of the data storage System. In contrast, relational databases bind the query engine and storage to a single, tightly coupled system, which allows certain types of optimizations. Splitting them, on the other hand, provides greater flexibility, despite potential performance losses.
Figure 1 below shows the prevalence of the main SQL engine, which is provided by Db-engines, the Austrian consulting firm, solid IT maintenance. Db-engines calculates a popular score for more than 200 database systems per month. Scores reflect search engine queries, online discussion mentions, job offers, professional qualification mentions, and tweets.
Source: db-engines,2017 Year January http://db-engines.com/en/ranking
Although Impala, Spark SQL, Drill, Hawq, and Presto have been beating hive on performance, concurrency, and throughput, hive is still the most popular (at least according to the Db-engines standard). There are 3 reasons:
- Hive is the default SQL option for Hadoop and is supported for each version. While others require specific suppliers and suitable users;
- Hive is already reducing the performance gap with other engines. Most of the replacements for Hive were introduced in 2012, and analysts waited for hive queries to be completed until they committed suicide. However, when Impala, Spark, Drill and other big strides, hive just followed, slowly improve. Now, although hive is not the quickest choice, it is much better than it was five years ago;
- While the speed of the frontier is cool, most organizations know that the world is not an end. Even a young marketing manager will have to wait 10 seconds to find out whether the Duxbury restaurant's chicken wings sold more than the beef burger in Tuesday.
As you can see in Figure 2 below, users are more interested in top-notch SQL engines than with leading commercial data warehousing applications.
Source: db-engines,2017 Year January http://db-engines.com/en/ranking
For open source projects, the best health metric is the size of its active developer community. As shown in Figure 3 below,Hive and Presto have the largest contributor base . (Spark SQL data is not there)
Source: Open Hub https://www.openhub.net/
In 2016, Cloudera, Hortonworks, Kognitio and Teradata were caught up in the benchmark battle that Tony Baer summed up, and it was shocking that the vendor-favored SQL engine defeated other options in every study, This poses a question: does benchmarking make sense?
Atscale two times a year benchmark testing is not unfounded. As a bi startup, Atscale sells software that connects the BI front-end and SQL backend. The company's software is engine-neutral, it tries to be as compatible as possible, and its extensive experience in the BI field gives these tests practical significance.
Atscale Recent key findings include Hive, Impala, Spark SQL, and Presto:
- 4 engines have successfully run Atscale's BI benchmark query;
- Each engine has its own performance benefits, depending on the amount of data, query complexity, and number of concurrent users:
- Impala and Spark SQL beat other people on small data volumes of queries;
- Impala and Spark SQL defeated others on a complex join of large data volumes;
- Impala and Presto perform better on concurrent testing.
- Compared to benchmark tests 6 months ago, all engines have a 2-4-fold performance boost.
Alex Woodie reported the test results, and Andrew Oliver analyzed them.
Let's take a closer look at these projects.
Apache Hive
in 2016, Hive had a contributor of more than 100 people . The team released Hive 2.0 in February and released Hive 2.1 in June. Improvements to Hive 2.0 include several improvements to Hive-on-spark, as well as performance, availability, support, and stability enhancements. Hive 2.1 includes the hive Llap ("Live Long and Process"), which combines a persistent query server with an optimized memory cache for high performance. The team claimed to have increased 25 times times.
Apache Impala
In 2012, Cloudera launched Impala, an open-source MPP SQL engine, as a high-performance alternative to hive. Impala uses HDFs and HBase, and uses hive metadata. However, it bypassed the run query using MapReduce.
。。。
In September, Cloudera released a study that compared the redshift Columnstore database for Impala and Amazon Web Services. The report is interesting to read, although the subject always needs to be aware of supplier benchmarks.
Spark SQL
。。。
Apache Drill
In 2012, a team led by MAPR, one of the leaders of the Hadoop distributor, proposed to build an open source version of Google Dremel, an interactive distributed hotspot analysis System.
In 2015, two key drill contributors left MapR and launched the Dremio, which has not yet been released.
Apache HAWQ
。。。
Presto
Facebook Engineers launched the Presto project in 2012 as a replacement for Hive's fast interaction. when it was launched in 2013, it successfully supported more than 1000 Facebook users and more than 30,000 petabytes of data per day. Facebook open source Presto in 2013.
Presto supports ANSI SQL queries for multiple data sources, including Hive, Cassandra, relational databases, and proprietary file systems (such as S3 for Amazon Web Service). Presto queries can federate multiple data sources. Users can submit queries through C, Java, node. js, PHP, Python, R, and Ruby.
。。。
Other projects
Pass
SQL data Analysis Overview--hive, Impala, Spark SQL, Drill, HAWQ, and Presto+druid