Big Data query analysis is one of the core issues in cloud computing, and since Google's 2006 paper laid the groundwork for cloud computing, especially GFS, map-reduce, and BigTable are the three cornerstones of cloud computing's underlying technologies. GFS and Map-reduce technology directly support the birth of the Apache Hadoop project. BigTable and Amazon Dynamo directly spawned the new NoSQL database domain, shaking the RDBMS's decades-old dominance of commercial databases and data warehouses. Facebook's hive project is a data warehouse infrastructure built on Hadoop that provides a range of tools for storing, querying, and analyzing large-scale data. When we are also immersed in the GFS, Map-reduce, bigtable and other Google technology, and understand, grasp, imitate, Google after 2009 years, continuous introduction of a number of new technologies, including: Dremel, Pregel, Percolator, Spanner and F1. Among them, Dremel prompted the rise of real-time computing systems, Pregel opened up a new direction of graph data calculation, percolator to make distributed Incremental index update into the new standard of text retrieval, spanner and F1 showed us the possibility of cross-datacenter database. In Google's second wave of technology, based on hive and Dremel, emerging big data companies Cloudera open source Big Data query Analysis engine Impala,hortonworks Open source Stinger,fackbook open source Presto. Similar to the PREGEL,UC Berkeley Amplab Lab, the Spark Graph Computing framework has been developed, and the Big Data query analysis engine shark is open source with spark as its core. Due to the selection of Big Data query engine in a telecom operator's project, this paper will briefly introduce the five main open source Big data query and analysis engines, such as Hive, Impala, Shark, Stinger and Presto, and then summarize and forecast the performance. The Evolution Atlas of Hive, Impala, Shark, Stinger, and Presto are shown in Figure 1.
Figure 1. An evolutionary atlas of Impala, Shark, Stinger and Presto
Introduction to current mainstream engines
Hadoop, based on the Map-reduce model, specializes in data batching, and is not a scenario that specifically matches instant queries. Real-time queries typically use the architecture of the MPP (massively Parallel processing), so users need to choose between Hadoop and MPP two technologies. In Google's second wave of technology, some of the fast-track SQL access technologies based on the Hadoop architecture have gradually gained people's attention. There is now a new trend in the combination of MPP and Hadoop to provide a fast SQL access framework. There are four popular open source tools recently: Impala, Shark, Stinger and Presto. This also shows the big data domain's expectations for supporting real-time queries in the Hadoop ecosystem. In general, the Impala, Shark, Stinger, and Presto four systems are all class SQL real-time Big data query analysis engines, but their technical focus is completely different. And they're not meant to replace hive, and hive is very valuable when it comes to data warehousing. These four systems and hive are data query tools built on top of Hadoop, each with a different focus on the adaptation surface, but from the client side they have a lot in common with hive, such as data table metadata, thrift interface, ODBC/JDBC driver, SQL syntax, flexible file format, Storage resource pools, and so on. The relationship between Hive and Impala, Shark, Stinger, Presto in Hadoop is shown in 2. Hive is suitable for long-time batch query analysis, while Impala, Shark, Stinger, and Presto are suitable for real-time Interactive SQL queries that provide data analysts with big data analysis tools for rapid experimentation and validation of ideas. You can use hive for data conversion processing, and then use one of these four systems for fast data analysis on a result dataset after hive processing. Below, a brief introduction to Hive, Impala, Shark, Stinger, and Presto from the problem domain:
1) Hive, draped in a map-reduce of SQL cloak. Hive is for user-friendly use of map-reduce and a layer of SQL outside, because hive uses SQL, its problem domain is narrower than map-reduce, because a lot of problems, SQL expression, such as some data mining algorithm, recommendation algorithm, image recognition algorithm, etc. These can still be done only by writing map-reduce.
2) Impala:google Dremel's Open Source implementation (Apache drill is similar), because the interactive real-time computing needs, Cloudera launched the Impala system, the system is suitable for interactive real-time processing scenarios, requires that the final data volume must be less.
3) Shark/spark: In order to improve the computational efficiency of Map-reduce, Berkeley Amplab Laboratory developed Spark,spark as a memory-based map-reduce implementation, in addition, Berkeley also encapsulated a layer of SQL on Spark, creating a new hive-like system shark.
4) Stinger Initiative (Tez optimized Hive): Hortonworks Open Source a DAG computing framework Tez,tez can be understood as the open source implementation of Google Pregel, the framework can be like Map-reduce, Can be used to design DAG applications, but it is important to note that Tez can only run on yarn. An important application of tez is to optimize the typical DAG scenario for hive and pig, which optimizes the DAG process by reducing the data read/write Io, which makes hive speed a lot faster.
5) Presto:facebook in November 2013, Presto, a distributed SQL query engine, was designed to specialize in high-speed, real-time data analysis. It supports standard ANSI SQL, including complex queries, aggregations (aggregation), connection (join), and Windowing Functions (window functions). Presto designed a simple abstraction layer of data storage that can be queried using SQL on a variety of data storage systems, including HBase, HDFS, scribe, and so on.
Figure 2. The relationship between Hive and Impala, Shark, Stinger, and Presto in Hadoop
Current mainstream engine architecture
Hive is a Hadoop-based data warehousing tool that maps a structured data file into a database table and provides a complete SQL query function that transforms SQL statements into Map-reduce tasks and is ideal for statistical analysis of data warehouses. As shown in Architecture 3, Hadoop and Map-reduce are the backbone of the hive architecture. The hive architecture includes the following components: CLI (Command line Interface), Jdbc/odbc, Thrift Server, Meta store, and driver (complier, optimizer, and executor).
Figure 3. Hive schema
Impala is the real-time interactive SQL Big Data Query tool developed by Cloudera, inspired by Google's Dremel, which can be seen as a Google Dremel architecture and MPP (massively Parallel processing) The structure of the Union body. Impala does not reuse slow hive&map-reduce batches, but rather by using a distributed query engine similar to the commercial parallel relational database (by query Planner, query Coordinator, and query Exec Engine three parts), you can query data directly from HDFs or hbase with SELECT, join, and statistical functions, which greatly reduces latency, as shown in Architecture 4, Impala is mainly composed of the Impalad,state store and the CLI. Impalad and Datanode run on the same node, represented by the Impalad process, which receives a query request from the client (the Impalad that receives the query request interprets the SQL query statement by invoking the Java front-end by JNI. , the query plan tree is generated, the execution plan is distributed to the other impalad with corresponding data through the scheduler, the data is read and written, the query is executed in parallel, and the result is streamed back to coordinator via the network and returned to the client by coordinator. Impalad also maintains a connection with the state store to determine which Impalad is healthy and can accept new work. The Impala state store tracks the health status and location information of Impalad in the cluster, represented by the state-stored process, which processes Impalad registered subscriptions and maintains a heartbeat connection with each impalad by creating multiple threads. Each impalad caches a copy of the information in the state store, and when the state store is offline, because Impalad has a cache of state stores that can still work, the cache data cannot be updated because some impalad fail. Causes the execution plan to be assigned to the failed Impalad, causing the query to fail. The CLI provides a command-line tool for user queries, and Impala also provides the Hue,jdbc,odbc,thrift interface for use.
Figure 4. Impala Architecture
Shark is a data warehouse product of UC Berkeley Amplab Open source, which is fully compatible with hive's HQL syntax, but unlike Hive, Hive's computational framework is map-reduce and shark uses spark. So, hive is SQL on Map-reduce, and Shark is hive on Spark. As shown in schema 4, for maximum retention and hive compatibility, Shark has reused most of hive's components, as shown below:
1) SQL Parser&plan Generation:shark is fully compatible with hive's hql syntax, and shark uses the Hive API to implement query parsing and query Plan generation, Only the final physical Plan execution phase uses spark instead of Hadoop map-reduce;
2) Metastore:shark uses the same meta information as hive, and the tables created in hive are seamlessly accessible with shark;
3) The serialization mechanism of Serde:shark and the data type are exactly the same as hive;
4) Udf:shark can reuse all UDFs in hive. By configuring the shark parameter, shark can automatically cache a specific RDD (resilient distributed DataSet) in memory, enabling data reuse to speed up the retrieval of a particular data set. At the same time, shark through UDF user-defined function to implement specific data analysis learning algorithm, so that SQL data query and operation analysis can be combined to maximize the reuse of RDD;
5) Driver:shark A package on the Clidriver base of hive, generating a sharkclidriver, which is the entrance to the shark command;
6) Thriftserver:shark on the basis of Hive's Thriftserver (support Jdbc/odbc), makes a package, generates a sharkserver, and also provides JDBC/ODBC services.
Figure 5. Shark architecture
Spark is a common parallel computing framework for the open source class Hadoop map-reduce of UC Berkeley AMP Labs, and Spark's distributed computing based on Map-reduce algorithms, with the benefits of Hadoop Map-reduce But unlike Map-reduce, where the job intermediate output and results can be stored in memory, which eliminates the need to read and write HDFs, Spark is better suited for map-reduce algorithms such as data mining and machine learning that require iterative iterations. Its architecture is shown in 6:
Figure 6. Spark Architecture
In contrast to Hadoop, Spark's intermediate data is put into memory and is more efficient for iterative operations, so spark is suitable for applications that require multiple operations on a particular data set. The more times you need to repeat the operation, the greater the amount of data to read, the greater the benefit, the smaller the amount of data, but the more dense the computation, the less benefit. Spark is more generic than Hadoop, and Spark offers a number of types of data set operations (map, filter, FLATMAP, sample, Groupbykey, Reducebykey, Union, join, Cogroup, Mapvalues, Sort,partionby, etc.), while Hadoop offers only map and reduce two operations. Spark can read and write data directly to HDFS and also supports spark on YARN. Spark can be run in the same cluster as map-reduce, shared storage resources and compute, data Warehouse shark implemented on borrowed hive, almost fully compatible with hive.
Stinger is a real-time class SQL Instant query system Hortonworks Open source, claiming to be able to increase the speed of 100 times times more than hive. Unlike Hive, Stinger uses Tez. So, hive is SQL on Map-reduce, and Stinger is hive on Tez. An important role of tez is to optimize the typical DAG scenario for hive and pig, which optimizes the DAG process by reducing data read/write Io, which makes hive speed a lot faster. As shown in Schema 7, Stinger is adding an optimization layer to the existing base of hive (this framework is yarn-based), and all queries and statistics are processed through its optimization layer to reduce unnecessary work and resource overhead. Although Stinger also optimizes and strengthens hive, stinger overall performance is dependent on the performance of its sub-system Tez. Tez, a DAG computing framework for Hortonworks Open source, can be understood as an open-source implementation of Google Pregel, which can be used to design DAG applications like Map-reduce, but it is important to note that Tez can only run on yarn.
Figure 7. Stinger Architecture
November 2013 Facebook Open Source a distributed SQL query engine, Presto, designed to perform high-speed, real-time data analysis. It supports a standard ANSI subset of SQL, including complex queries, aggregations, joins, and window functions. As shown in its simplified schema 8, the client sends SQL queries to the Presto coordinator. The coordinator checks the syntax, analyzes, and plans the query plan. The scheduler combines the executed pipelines, assigns the task to those nodes that are closest to the data, and then monitors the execution process. The client takes the data out of the output segment, which is taken in turn from the lower processing segment. The operational model of Presto is fundamentally different from hive. Hive translates the query into multi-stage map-reduce tasks, one after the other. Each task reads the input data from the disk and outputs the intermediate results to disk. However the Presto engine did not use Map-reduce. It uses a custom query execution engine and a response operator to support SQL syntax. In addition to the improved scheduling algorithm, all data processing is performed in memory. Different processing end through the network composed of processing lines. This avoids unnecessary disk reads and writes, and additional latency. This pipelined execution model runs multiple data processing segments at the same time, passing data from one processing segment to the next when the data is available. Such a way would greatly reduce the end-to-end response time for various queries. At the same time, Presto designed a simple data storage abstraction layer to satisfy the use of SQL to query on different data storage systems. The storage connector currently supports HBase, scribe, and custom developed systems in addition to HIVE/HDFS.
Figure 8. Presto architecture
Performance Evaluation Summary
Through the evaluation and analysis of Hive, Impala, Shark, Stinger and Presto, we summarize the following:
1) Columnstore generally has a noticeable improvement in query performance, especially if the large table is a table with many columns. For example, from Stinger (Hive 0.11 with orcfile) vs Hive, and Impala's parquet vs Text file;
2) Bypass The MR Calculation model, eliminating the persistence of intermediate results and the delay of Mr Task scheduling, resulting in performance gains. For example, Impala,shark,presto is better than hive and Stinger, but this advantage decreases as data volume increases and queries become complex;
3) using the MPP database technology is helpful for connection queries. For example, Impala has obvious advantages in two tables, multi-table connection queries;
4) The full use of the cache of the system in sufficient memory of the performance advantages of the obvious. For example, Shark,impala has obvious performance advantages when it is small data volume, the performance degradation is serious when memory is low, and Shark can cause many problems.
5) Data skew can seriously affect the performance of some systems. For example, Hive, Stinger, and shark are sensitive to data skew, prone to tilt, and Impala is unlikely to be affected by this;
For the five types of open source analysis engines, hive, Impala, Shark, Stinger, and Presto, in most cases, the IMAPLA's comprehensive performance is the most stable, the timing is the best, and the installation configuration process is relatively easy. The others are Presto, Shark, Stinger, and Hive respectively. Shark performance is best when memory is sufficient and non-join operations are in progress.
Summary and Prospect
For big data analytics projects, technology is often not the most critical, and the key is who has a stronger ecosystem, and technically a momentary lead is not enough to ensure the ultimate success of the project. For Hive, Impala, Shark, Stinger, and Presto, it's hard to say which product will be the de facto standard, but the only thing we can be sure of and firmly believe is that big data analytics will continue to spread as new technologies evolve, This is always a blessing for users. For example, if readers notice the evolution of next-generation Hadoop (yarn), yarn already supports computational paradigms other than map-reduce (such as Shark,impala, etc.), so Hadoop will probably be a big, inclusive platform in the future, In the provision of a variety of data processing technology, have to deal with the second magnitude of query, there are big data batch processing, a variety of functions, to meet the needs of users.
In addition to the open source schemes such as Hive, Impala, Shark, Stinger and presto, traditional vendors like ORACLE,EMC are not sitting idly by and waiting for their markets to be embezzled by open source software. Like EMC, the HAWQ system has been launched, claiming it is more than 10 times times faster than Impala, and Amazon's redshift offers better performance than Impala. Although open source software because of its strong cost advantage and has a very strong power, but the traditional database manufacturers will still try to introduce performance, stability, maintenance services and other indicators more powerful products to compete with it, and participate in the open source community, leveraging open source software to enrich their product line, enhance their competitiveness, And through more high value-added services to meet certain consumer needs. After all, these vendors often have accumulated a lot of technology and experience in traditional fields such as parallel databases, which are still very deep. On the whole, the future of Big data analysis technology will become more and more mature, cheaper and more easy to use, corresponding to the user will be more easily from their big data to dig out valuable business information.
Open source Big Data query analysis engine status