The operating language of the data is SQL, so many tools are developed with the goal of being able to use SQL on Hadoop. Some of these tools are simply packaged on top of the MapReduce, while others implement a complete data warehouse on top of the HDFs, while others are somewhere between the two. There are many of these tools, Matthew Rathbone, a software development engineer from Shoutlet, recently published an article that lists common tools and analyzes the scenarios and future of each tool.
Apache Hive
Hive is the original Sql-on-hadoop solution. It is an Open-source Java project that translates SQL into a series of mapreduce tasks that can be run on a standard Hadoop tasktrackers. Hive uses a metastore (itself a database) to store table patterns, partitions, and locations in order to provide functionality like MySQL. It supports most MySQL syntax while using a similar Database/table/view convention to organize datasets. Hive provides the following features:
HIVE-QL, a SQL-like query interface a command-line client supports metadata sharing through a central service JDBC driver multi-language Apache Thrift drives a Java API for creating custom functions and transformations when to use it?
Hive is a utility that almost all of the Hadoop machines have installed. Hive environments are easy to build and do not require a lot of infrastructure. Given the low cost of its use, there is little reason to shut it out.
However, it is important to note that hive's query performance is usually low because it translates SQL into a slower-running MapReduce task.
The future of
hive
Hortonworks is currently promoting the development of Apache Tez so that it can be used as a new hive backend to solve the problem of slow response time due to the use of MapReduce.
Cloudera Impala
Impala is an open source "interactive" SQL query engine for Hadoop. It was built by Cloudera, one of the largest Hadoop vendors in the market today. Like Hive, Impala also provides a way to write SQL queries for existing Hadoop data. Unlike hive, it does not use MapReduce to execute queries, but instead uses its own collection of execution daemons, which need to be installed with Hadoop data nodes. Impala provides the following features:
ANSI-92 SQL syntax support HIVE-QL supports a command line client ODBC driver and Hive Metastore Interop to implement Cross-platform mode share a C + + API for creating functions and transformations when to use it?
Impala's design goal is to be a complement to the Apache hive, so if you need faster data access than hive, it might be a better choice, especially if you deploy a Cloudera, mapr, or Amazon Hadoop cluster. However, in order to maximize the advantages of Impala you need to store your own data in a particular file format (parquet), this transformation may be more painful. In addition, you also need to install the Impala daemon on the cluster, which means it takes up a portion of the tasktrackers resources. Impala currently does not support yarn.
The future of
Impala
Cloudera has started trying to integrate Impala with yarn, which makes it less painful for us to do Impala development on the next generation of Hadoop clusters.
Presto
Presto is an open-source "interactive" SQL query engine developed in the Java language. It was built by Facebook, the original creator of Hive. Presto's approach is similar to Impala, which provides an interactive experience while still using existing datasets stored on Hadoop. It also needs to be installed on many "nodes", similar to Impala. Presto provides the following features:
Ansi-sql Syntax Support (possibly ANSI-92) JDBC drives a "connector" collection that is used to read data from an existing data source. Connectors include: HDFS, Hive, and Cassandra interact with hive Metastore to implement pattern sharing when do you use it?
Presto's goal is the same as Cloudera Impala. But unlike Impala, it's not supported by a major vendor, so unfortunately you don't get corporate support when you use Presto. But there are some well-known, respectable technology companies that have used it in the product environment, presumably with community support. Similar to Impala, its performance relies on a particular data storage format (rcfile). To be honest, before deploying Presto you need to think carefully about whether you have the ability to support and debug Presto, and if you are satisfied with it and believe that Facebook does not abandon the Open-source version of Presto, use it.
Shark
Shark is an open source SQL query engine developed by UC Berkeley University using the Scala language. Similar to Impala and Presto, its design goal is to complement hive, while executing queries on its own set of work nodes rather than using MapReduce. Unlike Impala and presto, Shark is built on the existing Apache Spark data processing engine. Spark is now very popular and its community is growing. Spark can be viewed as a faster alternative than mapreduce. Shark provides the following features:
SQL-like query language support, which supports most HIVE-QL a command line client (essentially hive client) interacts with hive Metastore to implement pattern sharing to support existing hive extensions, such as UDFs and serdes when to use it?
Shark is interesting because it wants to support both the Hive feature and the effort to improve performance. There are now many organizations using spark, but not sure how many are using shark. I don't think it's capable of catching up with Presto and Impala, but if you're ready to use spark then try to use shark, especially spark is being supported by a growing number of major vendors.
Apache Drill
The Apache drill is an open-source, "interactive" SQL query engine for Hadoop. Drill is now driven by MAPR, although they now support Impala. The Apache drill's goals are similar to those of Impala and Presto-Fast interactive querying of large datasets and the need to install work nodes (drillbits). The difference is that drill is designed to support a variety of back-end storage (HDFS, HBase, MongoDB), while one focus is on complex, nested datasets (such as JSON). Unfortunately, drill is now only in the alpha phase, so the application is not very extensive. Drill provides the following features:
ANSI SQL compatibility with some back-end storage and meta-data storage interactions (Hive, HBase, MongoDB) UDFs extension framework, storage plug-ins when to use it?
It's best not to use it. The project is still in the alpha phase, so do not use it in a production environment.
Hawq
HAWQ is a non-open source product of EMC Pivotal, provided as part of the company's proprietary Hadoop version, "Pivotal HD". Pivotal claims that HAWQ is "the world's fastest Hadoop SQL engine" and has developed for 10 years. But this view is hard to prove. It's hard to know what features HAWQ actually offer, but you can collect the following:
Full SQL syntax support enables the pivotal XTension Framework (PXF) and hive and hbase interoperability to interoperate with pivotal GemFire XD (memory real-time database) when it is used?
If you use the Hadoop version provided by the pivotal company, use it, otherwise it will not be used.
Bigsql
Big Blue has its own version of Hadoop, called Big Insights. Bigsql is provided as part of this release. Bigsql is used to query the data stored in HDFS using MapReduce and other methods that can provide low latency results (unknown). As you can see from Bigsql's documentation, it probably provides the following features:
JDBC and ODBC-driven extensive SQL support might have a command line client when to use it?
If you are an IBM customer then use it, otherwise it will not be used.
Apache Phoenix
Apache Phoenix is an open source SQL engine for Apache HBase. Its goal is to provide low latency queries for data stored in HBase through an embedded JDBC driver. Unlike the other engines described earlier, Phoenix provides read and write operations for hbase data. Its functions include:
a JDBC Driver a command line client the ability to bulk load data creates a new table, or maps to existing hbase data when it is used?
If you use hbase then use it. Although hive can read data from HBase, Phoenix also provides write functionality. It is not clear whether it is suitable for the product environment and transaction, but as an analysis tool its functionality is undoubtedly strong enough.
Apache Tajo
The goal of the Apache Tajo project is to build an advanced data warehousing system on top of HDFs. Tajo itself as a "big Data Warehouse", but it seems similar to the low latency query engines described earlier. Although it supports external tables and hive datasets (via Hcatalog), it focuses on data management, provides low latency data access, and provides tools for more traditional ETL. It also needs to deploy Tajo specific worker processes on the data nodes. Tajo features include:
ANSI SQL compliant JDBC Driver Integration Hive Metastore access to hive DataSet a command-line client a custom function API when to use it?
Although some of Tajo's benchmark results are pretty good, the benchmark test may be biased and not fully trusted. The Tajo community is not thriving yet, and there is no major Hadoop vendor in North America to support it. But if you're in South Korea, Gruter is the main project sponsor, if you use their platform then you might get their good support, otherwise it's best to use Impala or Presto these engines.