According to the O ' Reilly 2016 Data Science Payroll survey, SQL is the most widely used language in the field of data science. Most projects require some SQL operations, and even some require only SQL. This article will take you to learn about these mainstream open source SQL engines! Background introduction
This article covers 6 open source leaders: Hive, Impala, Spark SQL, Drill, Hawq, and presto, plus calcite, Kylin, Phoenix, Tajo, and Trafodion. and 2 commercially available options Oracle Big Data SQL and IBM Big SQL,IBM have not renamed the latter to "Watson SQL".
(Some readers asked: Druid?) My answer is: After checking, I agree that Druid belongs to this category. ）
Using the word SQL engine is a bit arbitrary. For example, hive is not an engine, its framework uses MapReduce, TeZ, or spark engine to execute queries, and it does not run SQL, but HIVEQL, a SQL-like language, very close to SQL. "Sql-in-hadoop" also does not apply, although hive and Impala use Hadoop primarily, but Spark, Drill, HAWQ, and Presto can also be used with a variety of other data storage systems.
Unlike relational databases, the SQL engine is independent of the data storage System. In contrast, relational databases bind the query engine and storage to a single, tightly coupled system, which allows certain types of optimizations. Splitting them, on the other hand, provides greater flexibility, despite potential performance losses.
Figure 1 below shows the prevalence of the main SQL engine, which is provided by Db-engines, the Austrian consulting firm, solid IT maintenance. Db-engines calculates a popular score for more than 200 database systems per month. Scores reflect search engine queries, online discussion mentions, job offers, professional qualification mentions, and tweets.
Source: db-engines,2017 Year January http://db-engines.com/en/ranking
Although Impala, Spark SQL, Drill, Hawq, and Presto have been beating hive on performance, concurrency, and throughput, hive is still the most popular (at least according to the Db-engines standard). There are 3 reasons:
Hive is the default SQL option for Hadoop and is supported for each version. While others require specific suppliers and suitable users;
Hive is already reducing the performance gap with other engines. Most of the replacements for Hive were introduced in 2012, and analysts waited for hive queries to be completed until they committed suicide. However, when Impala, Spark, Drill and other big strides, hive just followed, slowly improve. Now, although hive is not the quickest choice, it is much better than it was five years ago;
While the speed of the frontier is cool, most organizations know that the world is not an end. Even a young marketing manager will have to wait 10 seconds to find out whether the Duxbury restaurant's chicken wings sold more than the beef burger in Tuesday.
As you can see in Figure 2 below, users are more interested in top-notch SQL engines than with leading commercial data warehousing applications.
Source: db-engines,2017 Year January http://db-engines.com/en/ranking
For open source projects, the best health metric is the size of its active developer community. As shown in Figure 3 below, Hive and Presto have the largest contributor base. (Spark SQL data is not there)
Source: Open Hub https://www.openhub.net/
In 2016, Cloudera, Hortonworks, Kognitio and Teradata were caught up in the benchmark battle that Tony Baer summed up, and it was shocking that the vendor-favored SQL engine defeated other options in every study, This poses a question: does benchmarking make sense?
Atscale two times a year benchmark testing is not unfounded. As a bi startup, Atscale sells software that connects the BI front-end and SQL backend. The company's software is engine-neutral, it tries to be as compatible as possible, and its extensive experience in the BI field gives these tests practical significance.
Atscale Recent key findings include Hive, Impala, Spark SQL, and Presto:
4 engines have successfully run Atscale's BI benchmark query;
Each engine has its own performance benefits, depending on the amount of data, query complexity, and number of concurrent users:
Impala and Spark SQL beat other people on small data volumes of queries;
Impala and Spark SQL defeated others on a complex join of large data volumes;
Impala and Presto perform better on concurrent testing.
Compared to benchmark tests 6 months ago, all engines have a 2-4-fold performance boost.
Alex Woodie reported the test results, and Andrew Oliver analyzed them.
Let's take a closer look at these projects.
Apache Hive is the first SQL framework in the Hadoop ecosystem. Facebook engineers introduced hive in 2007 and donated the code to the Apache Software Foundation in 2008. In September 2010, Hive graduated to become Apache's top project. Each major contributor in the Hadoop Ecosystem publishes and supports hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon WEB Services provides a modified version of hive as a cloud service in elastic MapReduce (EMR).
Earlier releases of Hive ran queries using MapReduce. Complex queries need to pass data multiple times, which can degrade performance. Therefore, hive is not suitable for interactive analysis. Stinger, led by Hortonworks, has significantly improved hive performance, especially through the use of Apache Tez, an application framework for streamlining MapReduce code. Tez and Orcfile, a new storage format, have significantly increased the speed of hive queries.
The Cloudera lab led a parallel project to redesign the back end of Hive to run on Apache Spark. After a long period of testing, Cloudera released the official version of Hive-on-spark in early 2016.
In 2016, Hive had a contributor of more than 100 people. The team released Hive 2.0 in February and released Hive 2.1 in June. Improvements to Hive 2.0 include several improvements to Hive-on-spark, as well as performance, availability, support, and stability enhancements. Hive 2.1 includes the Hive Llap ("Live Long and Process"), which combines a persistent query server with an optimized memory cache for high performance. The team claimed to have increased 25 times times.
In September, the Hivemall project entered the Apache incubator, as I pointed out in the second part of my machine learning annual summary. Originally developed and donated by Treasure Data to the Apache Software Foundation, Hivemall is an extensible machine learning library, implemented through a series of hive UDFs designed to run MapReduce on Hive, Pig, and Spark SQL. The team plans to release the first version in the first quarter of 2017.
In 2012, Cloudera launched Impala, an open-source MPP SQL engine, as a high-performance alternative to hive. Impala uses HDFs and HBase, and uses hive metadata. However, it bypassed the run query using MapReduce.
Cloudera's chief strategy officer, Mike Olson, said at the end of 2013 that Hive's architecture was fundamentally flawed. In his view, developers can only use a completely new way to achieve high-performance SQL, such as Impala. In the January, May and September of 2014, Cloudera released a series of benchmark tests. In these tests, Impala demonstrated its gradual improvement in query operation and was significantly better than Tez-based hive, Spark SQL, and Presto. In addition to running fast, Impala also has outstanding performance in distribution, throughput, and scalability. In 2015, Cloudera donated Impala to the Apache Software Foundation and entered the Apache incubation program. Cloudera, MapR, Oracle, and Amazon Web Services distribute Impala,cloudera, MapR, and Oracle for commercial build and installation support.
Impala has developed steadily in the Apache incubator in 2016. The team cleaned up the code, migrated it to the Apache infrastructure, and released its first Apache version of 2.7.0 in October. The new version includes performance gains and scalability improvements, as well as some other minor enhancements.
In September, Cloudera released a study that compared the redshift Columnstore database for Impala and Amazon Web Services. The report is interesting to read, although the subject always needs to be aware of supplier benchmarks.
Spark SQL is the component that spark uses for structured data processing. The Apache Spark team released Spark SQL in 2014 and absorbed an early Hive-on-spark project called Shark. It quickly became the most widely used spark module.
Spark SQL users can run SQL queries, read data from hive, or use it to create spark datasets and DataFrame (a DataSet is a distributed collection of data, DataFrame is a uniformly named DataSet column). The Spark SQL interface provides spark with information about the data structure and execution of the operation, and the Catalyst optimizer of Spark uses this information to construct an efficient query.
In 2015, Spark's machine learning developers introduced the ML API, a package that uses spark DataFrame instead of the low-level spark RDD API. This approach proved to be attractive and fruitful; in 2016, with the release of 2.0, the Spark team changed the RDD-based API to maintenance mode. The DataFrame API is now the primary interface for Spark machine learning.
In addition, in 2016, the team also released structured streaming in the alpha version of Spark 2.1.0. Structured streaming is a flow-processing engine built on spark SQL. Users can query streaming data sources in the same way as static sources, and can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates the results when streaming data arrives. Structured streams provide a one-time fault-tolerant guarantee through checkpoints and pre-write logs.
In 2012, a team led by MAPR, one of the leaders of the Hadoop distributor, proposed to build an open source version of Google Dremel, an interactive distributed hotspot analysis System. They named it Apache Drill. Drill was snubbed in the Apache incubator for more than two years and eventually graduated at the end of 2014. The team released the 1.0 in 2015.
MapR distributes and supports Apache Drill.
In 2016, more than 50 people made a contribution to drill. The team released 5 small editions in 2016, with key enhancements including:
In 2015, two key drill contributors left MAPR and launched the Dremio, which has not yet been released.
Pivotal Software launched a commercially licensed high-performance SQL engine HAWQ in 2012, and has made little success in trying to market. After changing the strategy, Pivotal donated the project to Apache in June 2015 and entered the Apache incubator program in September 2015.
15 months later, HAWQ still stays in the incubator. In December 2016, the team released Hawq 188.8.131.52, adding some bug fixes. I guess it will graduate in 2017.
One of the favorite little points for HAWQ is that it supports Apache Madlib, a SQL machine learning project in the same incubator. The combination of HAWQ and Madlib should be a good consolation to the people who bought the Greenplum and wanted to know what had happened.
Facebook engineers launched the Presto project in 2012 as a replacement for Hive's fast interaction. When it was launched in 2013, it successfully supported more than 1000 Facebook users and more than 30,000 petabytes of data per day. Facebook open source Presto in 2013.
Presto supports ANSI SQL queries for multiple data sources, including Hive, Cassandra, relational databases, and proprietary file systems (such as S3 for Amazon Web Service). Presto queries can federate multiple data sources. Users can submit queries through C, Java, node. js, PHP, Python, R, and Ruby.
Airpal is a Web-based query tool developed by Airbnb that allows users to submit queries to Presto via a browser. The Qubole bit Presto provides management services. AWS provides Presto services on EMR.
In June 2015, Teradata announced plans to develop and support the project. Based on the announced three-phase plan, Teredata proposes to integrate Presto into the Hadoop ecosystem, to operate in yarn, and to enhance connectivity through ODBC and JDBC. Teredata offers its own Presto release, with a data sheet attached. In June 2016, Teradata announced the results of information Builders, Looker, Qlik, Tableau and Zoomdata, as well as ongoing microstrategy and Microsoft Power BI.
Presto is a very active project that has a huge and vibrant community of contributors. The team was released faster than Miki Sudo eating hot dogs-I counted 42 releases in 2016. Teradata doesn't want to summarize anything new, and I'm not going to sift through 42 release notes, so let's just say it's better.
Other Apache projects
There are also 5 additional SQL hybrid projects for the Apache ecosystem.
Apache Calcite is an open-source database building framework. It includes:
SQL parser, authenticator, and JDBC driver
Query optimization tools, including relational algebra APIs, rule-based scheduler, and cost-based query optimizer
Apache Hive uses calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.
The calcite team launched 5 releases in 2016, including bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.
Apache Kylin is an OLAP engine with a SQL interface. Developed by ebay and donated to Apache,kylin in 2015 to become a top-notch project.
Founded in 2016, Kyligence offers business support and a data warehousing product called KAP, although it is not listed on Crunchbase, with sources saying it has a strong background and a large office in Shanghai.
Apache Phoenix is a SQL framework running on HBase, bypassing MapReduce. Salesforce developed the software and donated it to Apache in 2013. May 2014 Project graduation becomes the top project. The project is included in the Hortonworks Hortonworks data platform. Since the leading SQL engine has been adapted to HBase, I don't know why we still need Phoenix.
Apache Tajo is a fast SQL Data Warehouse framework launched by Gruter in 2011, a big data infrastructure company, and donated to Apache in 2013. 2014 Tajo graduated as top-notch project. In addition to South Korea, the main market for Gruter, the project rarely attracts the interest of prospective users and contributors. Apart from Gartner's Nick Heudecker, the project is not on anyone's workbench.
Apache Trafodion is another Sql-on-hbase project, conceived by the HP Lab, that tells you almost everything you need to know. June 2014 HP released Trafodion, one months later, Apache Phoenix graduate production. 6 months later, HP executives said it had limited commercial potential relative to another sql-on-hbase engine, so they donated the project to Apache, which entered the incubator in May 2015.
If hatching ends, trafodion promises to become a transactional database. Unfortunately, there are plenty of options in this area, and the only competitive advantage of the development team seems to be "it's open source, so it's cheap".
6 major open Source SQL engine Summary, who is far ahead?