Recent advances in SQL on Hadoop and 7 related technology sharing

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Execute pass DFS present

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The greatest fascination with large data is the new business value that comes from technical analysis and excavation. SQL on Hadoop is a critical direction. CSDN Cloud specifically invited Liang to write this article, to the 7 of the latest technology to do in-depth elaboration. The article is longer, but I believe there must be a harvest. December 5, 2013-6th, with the theme of "application-driven architecture and technology", the seventh session of China Large Data technology conference (DA data Marvell Conference 2013,BDTC 2013), we will organize friends on the hot technology in-depth discussion, If you have any questions, please leave a comment in the comments section.

large data is now a very hot topic, from the engineering or technical point of view, the core of large data is how to store, analyze, mining a huge amount of data to solve practical problems. So for an engineer or analyst, how to query and analyze TB/PB level data is an unavoidable problem in the big data age. SQL on Hadoop becomes an important tool. Why do you have to put SQL on Hadoop? SQL is easy to use; Why does it have to be based on Hadoop? The Hadoop architecture has strong robustness and scalability. This paper analyzes the pros and cons of various SQL on Hadoop products from two perspectives of the technical architecture and the latest developments: Hive, Tez/stinger, Impala, Shark/spark, Phoenix, hdapt/hadoopdb, hawq/ Greenplum.

in the Internet enterprises and the traditional enterprises with large data processing needs, the data warehouse based on Hadoop has the following main data sources:

collects logs from Apache/nginx to HDFs through a log collection and analysis system such as Flume/scribe/chukwa, and then through hive queries.

uses tools such as Sqoop to periodically import user and business dimension data (typically stored in oracle/mysql) to hive, and OLTP data has a copy for OLAP.

data imported from other external DW data sources through ETL tools.

All of the SQL on Hadoop products are currently in place in one or some specific areas, without silver bullet. It is unrealistic to have a product that meets almost all enterprise-class applications like Oracle/teradata in the big Data age. So every SQL on Hadoop product is trying to meet the characteristics of a particular type of application. Typical requirements:

Interactive Query (ms~3min)

Data analyst,reporting Query (3min~20min)

data mining,modeling and SCM ETL (min ~ hr ~ Day)

machine learning Needs (through Mapreduce/mpi/spark and other computational models to meet)

Hive

Hive is the most commonly used solution for large data processing and data warehousing in Internet enterprises, and even in many companies, the Hadoop cluster is not intended to run native MapReduce programs, but is used to run hive SQL query tasks.

for companies with a lot of data scientist and analyst, there are many query requirements for the same table. So it's obvious that everyone is looking at data from the hive slow and wasteful. It would be much more efficient to put frequently accessed data into a memory-composed cluster for user inquiries. Facebook has developed a Presto for this demand, a system that puts hot data in memory for SQL queries. This design idea is very similar to Impala and Stinger. A simple query with Presto takes only hundreds of milliseconds, even a very complex query that takes only a few minutes to run, runs in memory, and does not write to disk. More than 850 of Facebook's engineers use it to scan more than 320TB of data every day, satisfying 80% of the hoc query requirements.

The main disadvantage of the current hive:

Data Shuffle When network bottlenecks, reduce to wait for the end of map to start, not efficient use of network bandwidth.

generally one SQL will be parsed into multiple Mr Job,hadoop each job output is written directly HDFs, and a large number of disk IO results in poor performance.

to start a task every time the job is executed, it takes a lot of time, and it can't be done in real time.

because of the conversion of SQL to MapReduce job, the SQL parsing performed by map, shuffle, and reduce is functionally different. Then there is the need for map->mapreduce or mapreduce->reduce, which can reduce the number of IO written HDFS and thus improve performance. But the current MapReduce framework does not support tasks such as M-&GT;MR or Mr->r.

current Hive major improvements (mainly embodied in Hive 0.11):

1. Merge of multiple Mr Tasks with the same hive SQL resolution. Mr Jobs, parsed by hive, has a very map->mapreduce type of job that can be considered for merging this process into a mrjob. https://issues.apache.org/jira/browse/HIVE-3952

2. Hive query optimizer (the Search optimizer is a topic that Hive needs continuous optimization)

such as the optimization of join order, is that a large table and several small tables in different column matching conditions join needs to resolve into multiple map join + Mr Job, can now be merged into a Mr Job.

this improvement direction to do is that users do not have to give too many hint,hive can be based on the size of the table, the number of rows, and so on, automatically choose the fastest join method (small table can be loaded into memory with the Map Join,map join can be combined with other Mr Job on the merger). This idea is a bit like cost-based query optimizer, where user-written SQL is more efficient at calculating the execution and join sequence before translating into execution plans.

3. Orcfile

Orcfile is a column-stored file that has great advantages for analytic applications.

The original rcfile each column as a binary blob, without any semantics, so you can only use a common zlib,lzo,snappy compression method. Orcfile can get the type of each column (int or string), you can use such as dictionary encoding, bit packing, Delta encoding, Run-length Lightweight compression technology such as encoding. The advantages of this compression technology are two points: one is to increase the compression rate, and the other is to filter the effect of unrelated data.

predicate pushdown: The original hive is to read all the data into memory, and then judge which is in line with the query requirements. In Orcfile, the data is read to the memory by stripe, then Orcfile Recordreader will judge whether the stripe satisfies the demand of the query according to Stripe's metadata (Index data, resident memory), if it is not satisfied, skip and read directly. , thus saving IO.

through the above analysis of orcfile, I think we have seen the shadow of BrightHouse. are the corresponding index of the column data, statistics, dictionaries and so on in memory to participate in the query conditions of filtering, if not in line with the direct skip read, a large number of savings IO.

4. Security and concurrency characteristics of HiveServer2

HiveServer2 can support concurrent client (JDBC/ODBC) access.

Cloudera also worked on a sentry for the security and authorization management of the Hadoop ecosystem. These two features are the main concerns of enterprise-class application hadoop/hive.

5. Unified meta-data management platform for Hcatalog Hadoop

There is no consistent guarantee between the table metadata stored in hive and the table data stored in the HDFS schema, that is, administrators. The current Hive changes to the column will only modify the Hive metadata without altering the actual data. For example, if you want to add a column, then you use the Hive command line just modified hive metadata, did not modify the HDFs stored on the format. You also have to change the format of the files stored on the HDFS by modifying the programs that import HDFs. The Hadoop system currently handles the table as ' schema on read ', and with Hcatlog can do Edw ' schema on write '.

6. Windowing and Analytics functions support.

Tez/stinger

Tez is a new DAG computing model based on yarn, mainly designed to optimize hive. At present, Tez/stinger is mainly Hortonworks, and they hope to resolve the problem of real-time computing by parsing hive SQL into Dag rather than mapreduce that can run on Tez. The main features of Tez are:

the underlying execution engine no longer uses MR, but rather uses a more generic Dag execution engine based on yarn

Mr is a highly abstract map and reduce two operations, while Tez provides richer interfaces based on these two operations. Map specific to input, Processor, sort, merge, output, and reduce also materialized into input, Shuffle, sort, merge, Processor, output. In fact, this is a bit similar to spark, is to provide a richer operational unit to the user.

traditional reduce can only be exported to HDFs, and Tez reduce processor can output to the next reduce processor as input.

Hot table also put in memory cache up

Tez Service: Pre-boot container and container reuse, which reduces the time that a task starts after each query execution plan is generated, thereby improving the timeliness.

Tez itself is only a library under the yarn framework, without deployment. Just specify Mapreduce.framework.name=yarn-tez

Tez/stinger also has one of the most important feature:vectorized Query Execution (the feature is available in HDP 2.0 GA).

the current row of hive processing data, and then calls lazy deserialization parse out the column's Java objects, obviously seriously affecting efficiency. vectorized Query execution reads and processes multiple rows of data simultaneously (basic comparisons or numerical calculations), reduces the number of function calls, and increases CPU utilization and cache hit rates.

Hive->tez/stinger The main direction of future work: Cost-based Optimizer, based on the statistical choice of execution strategy, such as multiple table join in what order to perform the most efficient. Count the number of Row/column in each intermediate table during execution, and decide how many Mr Executions to start.

Impala

Impala can be viewed as a hybrid of the Google Dremel Architecture and MPP (massively Parallel 處理) structure, which is currently dominated by Cloudera.

Advantages:

currently supports two types of join:broadcast join and partition join. For large table join due to memory limitations, the load will be dump part of the data to disk, which is slower.

Impala the transmission of data between tasks using the Push method (Mr Pull is the way), that is, upstream tasks will be pushed to the downstream, which can disperse network pressure, improve job execution efficiency.

parquet format, and can handle nested data. By nesting data and extended SQL query semantics, the join is bypassed in some specific scenarios to solve some of the performance bottleneck.

Cloudera Manager 4.6 will have slow query analysis.

Runtime Code Generation

Disadvantage:

Impala is not sorted by column by group

currently does not support Udf,impala 1.2 to support hive UDFs and Impala native UDFs and Udas

does not support Serializer/deserializer like hive, making it cumbersome to do ETL work from unstructured to structured data. So in essence, Impala is suitable for the query work of the ETL after the Mr Cooperation.

because Impala is designed to be short query, it does not support fault tolerance. If one of the node participating in the query fails, Impala will discard the query.

Security support is still relatively poor. Data transferred between Impalad is not encrypted and does not support table or column-level authorization.

each planfragment performs as parallel as possible, but sometimes it is not easy. For example, a hash join needs to wait until one of the tables is completely scan to begin.

has so many drawbacks, but many companies are beginning to try to Impala. Baidu, for example, Baidu tried to access the impala of MySQL as a storage engine, while implementing the corresponding operation of the planfragment, then the user to query or according to the original analytic method to resolve into a variety of planfragment, and then directly dispatched to the corresponding node (HDFS datanode/hbaseregionserver/mysql). Will put some source data or intermediate data into MySQL, user's query involves using this part of the data directly to MySQL to take.

Shark/spark

because the data can be put into memory as much as possible into memory, the use of memory is very aggressive. The advantage is to do the join will be faster, the disadvantage is that the memory is too large, and self management of memory, memory will not be released.

Since Shark borrowed Hive's codebase, SQL,SERDES,UDF support is fully compatible with hive.

supports different granularity queries from short query to long time query, so it has fault tolerance characteristics.

performance: Particularly simple select...where query, Shark Performance improvement is not obvious (because Hive also not time-consuming). However, if the query is more complex Select...join...where...group by,hive job number will be more, read and write HDFs more times, time will naturally become longer. Shark performance is best when memory is large enough to degrade if there is not enough memory to hold all the data, but it will be much better than hive.

Phoenix

Salesforce Open source hbase based SQL query system. The rationale is to convert a query that is more complex to hbase client into a series of region Scan, combining coprocessor and custom filter to parallel queries on multiple region servers, summarizing the Scan results. There are indications that Phoenix should not be an optimized OLAP system, but more like an OLTP system for simple single table queries, filtering, sorting, and retrieval.

Advantages:

hbase The default stored data types are strings, but Phoenix supports more data types.

uses JDBC to manipulate data, not hbase client APIs

the aggregation function by filtering the Where condition through the coprocessor at the Regionserver end. Comparing the architectures of the hive on Hbase,impala on HBase and Phoenix are similar, and the difference is that hive on HBase and Impala on HBase do not take advantage of coprocessor, all through HBase client The API reads the data into the memory of their own processes before doing the filter, aggregation, and so on.

From a query's point of view, HBase's column is divided into two main categories: primary key (Row key column) and other columns. The main difference is that the row key column can take advantage of the region server's index, filter, sort, and so on, while the other columns does not have these features and can only do some optimizations through a level two index. Phoenix can create a level two index on hbase to optimize conditional queries (currently only supports building level two indexes on static table, a more general HBase two-level index implementation method reference Https://github.com/Huawei-Hadoop/hindex).

if it is a in/or/like condition on the row key column, it can be optimized through the region server's skip scan filter.

Dynamic columns support.

Autocommit=false (default is False) all operations first slow existence of the client, only if you show commit only once batch to Hbase,sql resolution optimization is done at the client, this is a bit of business meaning.

Disadvantage:

does not support join, considering that HBase's design is designed to reduce complex join operations with redundant data, you can actually put the relevant data in the same table without having to split into multiple tables in order to reduce data redundancy, so it's not a disadvantage to a large extent.

from the architectural point of view is only to convert SQL to HBase Client API and Coprocessor calls, and coprocessor is not suitable for large-scale data transmission, so if the data volume of intermediate results is still relatively large performance problems are obvious.

The disadvantage of
is that all HBase SQL systems are available (including hive on HBase and Impala on HBase). No matter what request to the HBase Region server This side must pass the Regionscanner, this interface is not the OLAP-oriented application optimized storage file read interface. For example, the implementation of regionscanner many conditions are compared, is not conducive to the full table scan.

Another problem is the coprocessor problem, because the coprocessor and HBase Region server is in a JVM, so when the coprocessor computing logic is very complex, the intermediate result data is very large and consumes a lot of memory. At the same time, coprocessor is not a streaming reading of data, some node data accumulation too much will cause the problem of insufficient memory.

Roadmap:

join support, although a bit inconsistent with the design intent, but everyone support, I do not support, too outdated bar.

transaction support by referencing Https://github.com/yahoo/omid methods.

Online Schema Evolution, dynamically changing column type, rename, etc.

hadapt/hadoopdb

The
architecture is similar to the hive, with two types of underlying storage engines: HDFs and RDBMS (PostgreSQL), and an RDBMS node on a datanode node.

provides two kinds of interfaces: SQL and Mr,sql are also parsed into Mr Jobs to perform, so the overall execution engine is Mr.

multiple Mr Tasks into a single node sql+ a MR (data shuffle), which is similar to horizontal compression, vertical compression, minimizing the number of Mr Tasks parsed by SQL, and reducing the amount of IO data written HDFs between assignments. Split a SQL into two parts: for SQL to do with stand-alone SQL, not suitable for Mr (data shuffle)

The difference with hive is that hive can only manipulate the data on HDFs, while Hadapt is able to manipulate two sources of data from HDFs and RDBMS. For the RDBMS data source, the data is preloaded into a distributed RDBMS node with a unified catalog to manage the data in all RDBMS. For example, some of the execution logic in a map is obtained (modified InputFormat) directly from a SQL executed on an RDBMS, and then used Mr to do Join/group by. And if the distribution on the data is load to the distributed PG node is exactly the same as that of group By/order by, it will not be done with Mr Shuffle.

The essence of
hadapt is to parse SQL into Mr Tasks, so hive has some drawbacks (long startup time, low join efficiency). And if you want Join/group by/order by being able to efficiently execute between RDBMS data sources, consider the problem of data preloading.

when executing multiple queries, subsequent queries can improve the performance of queries by leveraging the query results (somewhat similar to the concept of materialized views in the Data Warehouse) that were queried earlier.

now most of the solution used in enterprise applications is the HADOOP+MPP approach, that is, through Hadoop batch unstructured data (for ETL operations) and then through the Connector import MPP for structured data query operations. But this is only a temporary alternative, Hadapt said invisible loading is the most reasonable, so the enterprise has a unified analysis platform.

Hawq

Original GPDB storage is a local disk, now changed to HDFs, the original gpdb of a single node of the RDBMS as the function of the execution engine, no longer serve as the storage engine function.

The
query executes through the GPDB parallel execution engine (no longer using MR), each time the query starts to import data from the HDFs into the gpdb, which is written to the disk at the end of each task by exchanging data in memory rather than Mr.

GP-specific cost-based parallel query optimizer and planner is one of its major strengths and is not currently available in most other products, and it can help users choose the most efficient execution order for the SQL.

The benefits of using gpdb as an execution engine: standard SQL compatibility, and support for acid transactions; JDBC/ODBC support; Join order optimization and index support (query optimizer); Support row/column two storage formats.

GPXF enables HAWQ to read data in any format stored on HDFS and data stored in other file systems and devices.

The
underlying HDFS needs to support trancate semantics and native C interface.

Support In-database Analytics (http://madlib.net/):

Performance Related:

Scott Yara (Greenplum boss) openly admits Hawq is slower than pure gpdb. The purpose of this is to better utilize the scalability of HDFS, unified storage Management.

Comparing performance comparisons with other SQL on Hadoop products, HAWQ has a significant advantage over the group by and join operations compared to other scenarios, provided that the amount of data is not particularly large.
(is not because the data import time partition do well, is not take the time of load to change group By/join time? )

in short, the common weakness in the SQL on Hadoop domain is:

1. Workload Management and query optimization how the join of multiple tables executes, for example, 3-table joins have 6 execution strategies, which is the most efficient. It is obvious that the cost of each execution order is calculated. There are very good query optimizer in traditional database or Data warehouse field, and how to measure the relationship between these metrics (disk IO, network bandwidth, memory) and the last query efficiency in Distributed system is a problem that needs careful study.

2. Association subquery Correlated Sub-queries Still no one can achieve it. There are many examples of associated subqueries in TPC, but now the SQL on Hadoop product is not supported. People who listen to Impala say that their clients ' needs are not very strong, and most of the associated subqueries can be converted into join operations. But the current commercial product like HAWQ is to support the association subquery.

　　In addition to the open source products discussed above, there are many commercial products in the field of large data analysis. These commercial products can be divided into two categories: one for enterprise-class applications, selling license or software and hardware in the form of the sale of Teradata/aster Data, Hp/vertica, Sap/hana,ibm/bigsql, Oracle and Microsoft have similar products, and the other is the google/bigquery (typical analysis as a service) and AMAZON/REDSHIF that provide data analytics services using a large cloud computing infrastructure.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More