Recent advances in SQL on Hadoop systems (1)

Source: Internet
Author: User
Tags joins mysql code shuffle hortonworks hadoop ecosystem

Transferred from: http://blog.jobbole.com/47892/

Why do you have to put SQL on Hadoop? SQL is easy to use.
Why does it have to be based on Hadoop? The robust and scalable architecture of Hadoop

Currently, there are several main types of SQL on Hadoop products:
Hive, Tez/stinger, Impala, Shark/spark, Phoenix, Hawq/greenplum, hadoopdb, Citusdata, etc. This article focuses on the features and latest developments of Hive, Tez/stinger, Impala, shark, and the traditional open source data Warehouse BrightHouse; The next article will discuss Hawq/greenplum, Phoenix, hadoopdb, Citusdata.

The main data sources for Hadoop-based data warehouses in Internet enterprises are the following:
1, through the Log collection and analysis system Flume/scribe/chukwa such as Apache/nginx from the server cluster, such as the log collection to HDFs, You then specify Serde to convert unstructured log data into structured data when you create a table from hive.
2, by using a tool such as Sqoop to import user and business dimension data (typically stored in oracle/mysql) to hive periodically, OLTP data has a copy for OLAP.
3. Data imported from other external DW data sources through the ETL tool.

All of the current SQL on Hadoop products are actually suitable in one or some specific areas, without the silver bullet. It is unrealistic at this stage to satisfy almost all enterprise-class applications like Oracle/teradata. So every SQL on Hadoop product is trying to satisfy the characteristics of a particular class of applications.
Typical requirements:
1, interactive query (ms~3min)
2,data Analyst, reporting query (3min~20min)
3,data mining, modeling and large ETL (min ~ hr ~ Day)
4, machine learning requirements (to be met by computational models such as Mapreduce/mpi/spark)

Hive
Hive is the most common solution for big data and building data warehouses in the Internet enterprise, even when many companies deploy Hadoop clusters not to run native MapReduce programs, but to run hive SQL query tasks.

For companies with a lot of data scientist and analyst, there are a lot of query requirements for the same table. So it's clear that everyone is looking at data from hive that is slow and wasteful of resources. When we deploy online database systems, we deploy Redis or memcache in front of the db to cache data that users frequently access. OLAP applications can also refer to a similar approach by putting frequently accessed data into a memory-composed cluster for users to query.
Facebook has developed Presto for this need, a system that puts hot data into memory for SQL queries. This design idea is very similar to Impala and Stinger. A simple query using Presto takes only hundreds of milliseconds, and even a very complex query can be done in minutes, running in memory and not writing to the disk. Facebook has more than 850 engineers using it every day to scan more than 320TB of data to meet 80% of AD-HOC query requirements.

Current Hive's main drawbacks:
1,data Shuffle network bottleneck, reduce to wait until the end of the map to start, not efficient use of network bandwidth
2, typically a SQL will be parsed into multiple Mr Job,hadoop each job output is directly written HDFs, poor performance
3, every job to start a task, spend a lot of time, can't do real-time
4, the SQL functions performed by map,shuffle and reduce are different when SQL is converted to a mapreduce job. Then there is the need for map->mapreduce or mapreduce->reduce. This reduces the number of write HDFs, which can improve performance.

The main improvements in hive currently:

1, the merge of multiple Mr Tasks resolved by the same hive SQL.
There are a lot of map->mapreduce types of jobs that are parsed by hive, and you can consider merging this process into a single mrjob. https://issues.apache.org/jira/browse/HIVE-3952

2,hive query optimizer

Http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/optimize-joins.html

    • Joins where one side fits in memory
    • The improvement of the Star schema join is that the join needs to be parsed into multiple map join + Mr Jobs when a large table and multiple small tables are matched in a different column, and can now be combined into one Mr Job

The improvement direction to do is that the user does not have to give too many hint,hive can automatically choose the fastest join method according to the table size, number of rows, etc. (the small table can be loaded into memory, the map Join,map join can be combined with other Mr Job merge). This idea is somewhat similar to cost-based query optimizer, where the SQL written by the user is calculated to be more efficient before being translated into the execution plan.

3,orcfile
Orcfile is a columnstore file that has a great advantage for analytic applications. Http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
The original rcfile in each column as a binary blob, without any semantics, so can only use the general Zlib,lzo,snappy compression method.
Orcfile is able to get the type of each column (int or string), then you can use such as dictionary encoding, bit packing, Delta encoding, Run-length Encoding and other lightweight compression technology. The advantages of this compression technique are two points: one is to improve the compression rate, and the other is to be able to filter irrelevant data effects
There are now three main types of orcfile in the code:

    • Bit encoding, all data types are available. Google's protocol buffers and uses the high bit to represent whether this byte isn't the last and the lower 7 bits to enc Ode data
    • Run-length Encoding (stroke length compression algorithm), int type exclusive.
    • Dictionary encoding,string type-specific. The dictionary also helps to filter the predicate conditions in the query.

Run length encoding for some column compression will reduce the storage of 3-4 of magnitude, the memory upgrade also has 2-3 magnitude, Dictionary encoding generally reduce disk space by about 20 times times, the memory space is approximately 5 times times, according to Google Powerdrill experiments, in the common aggregation query these special coding methods will have a 2-3-magnitude increase in query speed.

predicate pushdown: The original hive is to read all the data into memory, and then to determine which is in line with the query requirements. In Orcfile the data is read to memory in stripe, then Orcfile's recordreader will determine whether the stripe satisfies this query based on the stripe metadata (Index data, resident memory), if it does not meet the need to skip the read directly , thus saving IO.

About the compression effect of orcfile, usage and performance can refer to Hortonworks's blog

http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Future Orcfile will also support lightweight indexes, which are the maximum and minimum values for a set of 1W rows in each column.

Through the above analysis of orcfile, I think we have seen the shadow of BrightHouse. is to put the column data corresponding index, statistics, dictionaries, etc. into the memory to participate in the query condition filtering, if not meet directly skip read, a lot of savings IO. For BrightHouse You can refer to the following analysis.

Security and concurrency characteristics of 4,hiveserver2

http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/

HiveServer2 is capable of supporting concurrent client (JDBC/ODBC) access.
Cloudera also engaged in a sentry of security and authorization management for the Hadoop ecosystem.
These two features are the main concerns of enterprise-class applications hadoop/hive.

5,hcatalog Hadoop's unified metadata management platform
There is no consistency guarantee on the schema between the table metadata stored by hive and the tabular data stored in HDFs, which is guaranteed by the administrator. Currently hive changes to columns only modify hive metadata without changing the actual data. For example, if you want to add a column, you can simply modify the hive metadata with the hive command line, without modifying the format stored on the HDFs. You also have to change the format of the files stored on HDFs by modifying the programs that import HDFs. And also restart the Hive parsing service, exhausted the system administrator.

    • The Hadoop system currently handles the table as ' schema on read ', and with Hcatlog you can do Edw ' schema on write '.
    • The Hcatlog provides the rest interface with metadata services that facilitate the sharing of different data (unstructured/semi-structured/structured) on different platforms (Hdfs/hbase/oracle/mysql). The ability to combine Hadoop with Edw.
    • The Hcatlog has decoupled schema and storage format from the user. For example, in writing the Mr Task, the current is to all the rows of data as text to deal with, text a little bit to parse out the individual column needs programmers to control. There is a hcatlog after the programmer will not care about this matter, directly tell it is which database->table, and then the schema can be obtained by querying the Hcatlog. Also save the data storage format changes, the original program can not be used to occur.

6,vectorized Query execution in Hive

https://issues.apache.org/jira/browse/HIVE-4160

    • The current line of processing data in hive, and then calling the lazy deserialization to parse out the Java object for that column, will obviously have a significant impact on efficiency.
    • Multi-line data is read and processed at the same time (basic comparisons or numerical calculations), reducing the number of excessive function calls in one line of processing, improving CPU utilization and cache hit ratio
    • Basic operating units such as vector-based vectorized scan, filter, scalar aggregate, group-by-aggregate, hash join are implemented.

Tez/stinger

    • The underlying execution engine no longer uses MR, but instead uses yarn-based, more generic, dag execution engines
    • Mr is a highly abstract map and reduce two operations, while Tez provides a richer interface based on these two operations. The map is specific to input, Processor, sort, merge, output, and reduce also materializes into input, Shuffle, sort, merge, Processor, output. In the MR Program, programmers only need to write the corresponding processor logic, others by specifying several specific implementations, and in the Tez gives us greater freedom. In fact, this is a little bit like spark, which provides a richer and more operable unit to the user.
    • The traditional reduce can only be output to HDFs, while the Tez's reduce processor can be output to the next reduce processor as input.
    • Hot table is also put in memory cache up
    • Tez Service: Pre-boot container and container reuse reduce the time that task starts after each query execution plan is generated, improving real-time performance.
    • The Tez itself is just a library of yarn frames, without deployment. Simply specify Mapreduce.framework.name=yarn-tez

http://dongxicheng.org/mapreduce-nextgen/apache-tez-newest-progress/

Future direction of work:
Cost-based Optimizer, based on the statistical selection execution strategy, the most efficient way to perform multiple table joins in what order.
Count the number of Row/column of each intermediate table in the execution process, thus deciding how many Mr Executions to initiate

Impala
Impala can be seen as a hybrid of the Google Dremel architecture and the MPP (massively Parallel processing) structure.

Https://github.com/cloudera/impala

Dremel paper: http://research.google.com/pubs/pub36632.html

Advantages:

    • Two types of join:broadcast join and partition join are currently supported. For large table joins due to memory limitations, will dump some of the data to disk, so it would be slower
    • Parguet the stored format, while being able to handle nested data. By nesting the data and extending the SQL query semantics, the join is avoided on some specific scenarios to resolve some of the performance bottleneck.
    • Cloudera Manager 4.6 will have the analysis function of slow query
    • Runtime Code Generation http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/
    • Impala can directly use data on the hard drive without HDFs

Disadvantages:

    • Impala does not sort by column of group by
    • Currently unsupported Udf,impala 1.2 is about to support Hive UDFs (Java write) and Impala native UDFs and Udas (interfaces similar to posgresql)
    • Serializer/deserializer like Hive are not supported, making it cumbersome to do ETL work from unstructured to structured data.
    • Line query fault tolerance is not supported, and Impala discards this query if an error occurs with a node participating in the query.
    • Security support is still relatively poor. Data transmitted between Impalad is not encrypted and does not support table or column-level authorization.
    • Each planfragment execution is as parallelized as possible, but sometimes it is not easy. For example, a hash join needs to wait until one of the tables is fully scan finished before starting.

Although there are so many shortcomings, many companies have started to try Impala. To Baidu, for example, Baidu tried to access the back end of the impala as a storage engine, and the corresponding operation corresponding to the planfragment, then the user to query or according to the original analytic method to parse into various planfragment, and then directly dispatched to the corresponding node (HDFS datanode/hbase regionserver/mysql). Will put some of the source data or intermediate data into MySQL, the user's query involves the use of this part of the data directly to the MySQL to take.

Shark/spark
Since the data can be put into memory as much as possible in memory, using memory is very aggressive. The advantage is that the join will be faster, the disadvantage is that the memory is too large, and self-management memory, memory consumption will not be released.
Support UDF
Performance:
Especially simple select...where query, shark performance is not significantly improved. (Because hive doesn't cost much time)
However, if the query is more complex Select...join...where...group by,hive job number will be more, read and write HDFs more times, the time will naturally become longer. Shark performance is best when memory is large enough, and performance degrades if the memory is not enough to load all the data, but it is much better than hive.

where SQL on Hadoop products need to learn from traditional data warehouses
Take the open source Data Warehouse brighthouse(MySQL-based Data Warehouse storage engine) as an example.
VLDB 2008 Thesis <<brighthouse:an Analytic Data Warehouse for Ad-hoc queries>>

BrightHouse SQL parsing uses MySQL code, developed BrightHouse dedicated optimizer,executor and storage engine
BrightHouse data storage is organized through three tiers: the Datastore, Data Pack node, knowledge node

    • DP (Data Pack): BrightHouse is Columnstore, each DP stores data for 64K cells in a column.
    • DPN (Data Pack Node):D PN and DP are a one-to-ones relationship, and some statistical values for each DP data are recorded in DPN (Max,min,count,sum)
    • KN (Knowledge Node):D P's more detailed data information and DP relationship information

The KN is divided into three parts:

    • Hists (histograms): A statistical histogram of numeric type columns that can quickly determine if the DP meets the query criteria.
    • CMAPs (Character Maps): A text-type bitmap that is used to quickly find characters. (Optimization keyword like)
    • Pack-to-pack: A bitmap of the relationship between two columns (DP) that is generated when the join operation is equivalent.

DPN and KN are equivalent to some of the DP statistics, accounting for 1% of the total DP storage space, so you can easily load memory. They are designed to quickly locate which DP is associated with this query (relevant), which are irrelevant (irrelevant), and which are likely related (suspect). This reduces the amount of data read by IO and improves performance.

Performance test: http://www.fuchaoqun.com/tag/brighthouse/
From this performance test, you can see:
1, compression ratio: Infobright than myisam/tar.gz compression ratio is much higher
2, query performance: query speed is 3-6 times faster than the indexed MyISAM table

In short, what we all lack is:
1,workload Management or query optimization
How the join of multiple tables is performed, such as the 3-table join will have 6 execution strategies, which is the most efficient. It is obvious that you have to calculate the cost of each execution order. In the traditional database or Data Warehouse (ORACLE/TERADATA/POSTGRESQL) has a very good query optimizer, and in the distributed system how to measure these indicators (disk IO, network bandwidth, memory) and the final query efficiency of the relationship is a need to study seriously.
2, correlated subquery correlated sub-queries still no one can achieve.
There are many examples of correlated subqueries in tpc-h, but now SQL on Hadoop products are not supported. People listening to Impala say that their customers ' demand for this is not very strong, and most of the associated subqueries can be translated into join operations. But current commercial products like Hawq/greenplum are supported by correlated subqueries.

Recent advances in SQL on Hadoop systems (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.