Reprinted from: http://sunyi514.github.io/2014/11/15/%E7%9B%98%E7%82%B9sql-on-hadoop%E4%B8%AD%E7%94%A8%E5%88%B0%E7%9A%84% e4%b8%bb%e8%a6%81%e6%8a%80%e6%9c%af/
Since the advent of hive, after several years of development, SQL on Hadoop-related systems have been blossoming, faster and more complete. This article is not to compare the so-called "interactive query which strong", but to try to comb out a unified perspective, to see what the various systems have the technical similarities.
Given the breadth and maturity of the system's use, for example, hive and Impala are typically taken for example, and in the course of the investigation there are other systems, such as Spark Sql,presto,tajo. But for commercial products such as HAWQ and Apache drill Such maturity is not very high open source solutions do not do too much understanding.
System Architecture Runtime Framework V.s. MPP
In the SQL on Hadoop system, there are two architectures, one based on a runtime framework to build the query engine, the typical case is hive, and the other is the MPP schema modeled on the past relational database. The former has an existing runtime framework and then sets up the SQL layer, which is the first to create an integrated query engine from scratch. Sometimes we can hear a voice saying that the latter architecture is better than the former, at least in performance. So is it true?
In general, one of the most important evaluation indicators for the SQL on Hadoop system is: fast. All of the things that are mentioned later are mostly for faster query times. With the gradual popularization of hive, there is a growing need for so-called interactive queries, because neither the BI system nor the adhoc can play at the same pace as offline. That's when big companies, like Facebook, and professional providers (such as Cloudera) are trying to solve the problem. Short-term can be supported by business or relational databases, but the long-term solution is to refer to the past MPP database architecture to create a specialized system, so there is impala,presto and so on. From the point of view of task execution, the task execution of this kind of engine is similar to the DAG model, then there is the computational framework of the DAG model of Spark, but this is the child of the other family, and the way to put SQL and hive to spark. As a result of Impala's Advent, it was emphasized that "computing is all in memory", and that performance was also a kind of hive that had only Mr as the computational model. So is the "based on existing computing model" represented by hive really not working?
Admittedly, in this way, the class MPP pattern does have many advantages:
- DAG v.s. MR: The main advantage of intermediate results is that the disk is not written (unless memory is not enough), one go.
- Pipeline calculation: The upstream stage results are immediately pushed or pulled to the next stage processing, such as multi-table join when the first two tables have results directly to the third table, unlike Mr to wait two tables complete join and then to the third table join.
- Efficient IO: Local queries have no excess consumption and make full use of the disk. This is followed by detail.
- Thread-level concurrency: In contrast to Mr, each task starts the JVM itself with a lot of latency and resource usage.
Of course, the MPP model also has its disadvantage, one is not very high scalability, this in the relational database era has been concluded, and the other is poor fault tolerance, for Impala, once the operation of the process of a problem, the entire query is hung up.
However, after continuous development, hive can also run on the DAG framework, not only Tez, but also spark. Some of the disadvantages mentioned above, in fact, can also be solved in the computational model, only considering the generality of the computational model and its own design goals, will not be specifically met (so if from this angle, Impala belongs to "dedicated system", Spark belongs to "universal system"). In the recent Cloudera-benchmark, although Impala still leads the way, spark-based SQL is not inferior to Presto, and Tez-based hive is not very poor, at least in concurrency mode, which can exceed Presto. It is evident that the MPP model does not have the absolute upper hand. So this architectural difference in my opinion is not the key to winning, at least not the only factor, really to do fast query, all aspects of the details must be certain. These are all the details that follow.
Core components
Regardless of the architecture mentioned above, a SQL on Hadoop system generally has some common core components, which are based on the designer's considerations in different node roles, the physical nodes are master/worker to do, if the master pressure is too large, Some components that would otherwise fit on the master can be placed on a secondary master.
- The UI layer is responsible for providing the interface for user input queries. Generally there are Web/gui, command line, 3 classes of programming mode.
- The QL layer is responsible for parsing user-submitted queries into executable plans (such as Mr Jobs) that can be run. This section will be specifically mentioned in the following sections.
- The execution layer is the operation of the specific job. Typically, a master is responsible for query management, such as requesting resources, observing progress, and so on, while Master is responsible for eventually aggregating local results to global results. Each node will have a corresponding worker doing local calculations.
- The IO layer provides an interface for interacting with the storage layer. For HDFs, you need to convert the file to K/v,serde based on I/O format and then complete the k/v-to-data row mapping. For non-HDFS storage, some special handler/connector are required.
- Storage tiers are typically hdfs, but they can also be queried for NoSQL, or relational databases.
- The system also needs a meta-data management Service, management table structure and so on.
Execution Plan compilation process
From SQL to execution plan, it is roughly divided into 5 steps.
- The first step is to convert SQL into an abstract syntax tree ast. This step generally has a third-party tool library to complete, such as ANTLR.
- The second step is to perform semantic analysis of the AST, such as whether the table exists, whether the field exists, and whether the SQL semantics are incorrect (for example, the field in select that is determined to be aggregated does not appear in group by).
- The third step is to generate a logical execution plan, which is a DAG consisting of logical operators. For example, for hive, the sweep table produces tablescanoperator, and aggregations produce groupbyoperator. For a class MPP system, the situation is slightly different. The types of logical operators are still the same, but Mr. will be a stand-alone version and then build a multi-machine version. Multi-machine version is mainly the Aggregate,join, and top n These operations parallelization, such as aggregate will be divided into similar to Mr as local aggregate,shuffle and global aggregate three steps.
- The fourth step is to do the logical execution Plan optimization, this step is described separately below.
- The fifth step translates the logical execution plan into a physical plan that can be run on the machine. For Hive, it is Mr/tez job, and for Impala, plan fragment. Other types of MPP systems are similar concepts. A computational unit (or job) in a physical plan consists of three elements of "input, process, output", while the operator in the logical execution plan is more granular, and a logical operator is generally in the role of one of these three elements.
Here are two examples, the intuitive understanding of the relationship between SQL, logical planning, physical planning, the specific explanation of each operator words will be relatively fine, it will not unfold.
Hive on MR:
Select Count (1from where='2009-08-01'
Presto (quoted from the US team of technical teams, where Subplan is a computational unit of the physical plan):
Select Count (*fromjoinon=where> Tengroupbyten;
Optimizer
The optimization of the execution plan, though not necessarily the hardest part of the entire compilation process, is the most fascinating part and is still evolving. The spark system abandoned shark to do spark SQL, a large part of the reason is to do the optimization strategy, to avoid the constraints of the hive, for this purpose also specifically independent of the optimizer component catalyst (of course, spark SQL is still very new, Its future development gives people a lot of imagination space. In short, this part of the work can continue to innovate, the more intelligent optimizer, the more fool, the more users can be liberated to solve business problems.
Early in hive there were only a few simple rule optimizations, such as predicate push-down (the filter is done as much as possible after the table scan), Operation merging (continuous filter with and combined into a operator, continuous projection can also be combined). Some slightly more complex rules were added later, such as Join + GROUP by of the same key combined into 1 Mr and star schema join. The dependency optimization introduced in hive 0.12 (correlation optimizer) is a peak in rule optimization, and he is able to reduce the duplication of data scanning, specifically, if the two parts of the query use the same data, and each group by/ Join the use of the same key, this time because the data source and shuffle key is the same, so the original needs to be two jobs separate processing place to synthesize a job processing.
For example, the following SQL:
SELECT sum(L_extendedprice)/ 7.0 asavg_yearly from (SELECTL_partkey, l_quantity, L_extendedprice fromLineItemJOINPart on(P_partkey=L_partkey)WHEREP_brand='brand#35' andP_container= 'MED PKG') TouterJOIN (SELECTL_partkey asLp0.2 * avg(l_quantity) asLQ fromLineItemGROUP byl_partkey) Tinner on(Touter.l_partkey=TINNTER.LP)WHERETouter.l_quantity<Tinner.lq
This query has two occurrences of the LineItem table, group by and the two join IS L_partkey, so two subqueries and a join to use three jobs, now only need to use a job can be completed.
However, rule-based optimization (RBO) does not solve all problems. There is another way of optimizing the relational database, which is the cost-optimized CBO. The CBO answers some questions by collecting data about the table, such as the cardinality of the field, the histogram of data distribution, and so on, the most important of which is to determine the order of the multiple table joins. The CBO can find the best order by searching all the solution spaces in the join order (the greedy algorithm with limited depth can be used for too many tables) and calculating the corresponding cost. These have already been practiced in the relational database.
Hive has now started a special project, that is, Apache Optiq to do this thing, and other systems did not do a good CBO, so this piece of content has a lot of room for improvement.
Execution efficiency
Even with an efficient execution plan, if the process itself is less efficient, then a good execution plan can be compromised. The main focus here is CPU and IO execution efficiency.
Cpu
During the execution of the calculation, the inefficient CPU causes the bottleneck of the system to fall on the CPU, which causes the IO to be underutilized. In a comparison between Impala and Hive, it was found that hive is slower than Impala on some simple queries (tpc-h query 1) primarily because the hive runtime is completely in the CPU bound state and disk IO is only 20%. And Impala's Io is at least 85%.
There are several main reasons why CPU bound occur in SQL on Hadoop in the following ways:
- A large number of virtual function calls: This problem occurs in multiple places, such as for
a + 2 * b
expression calculations, the interpreter constructs an expression tree, and the process of interpreting is to recursively invoke the child nodes to do evaluation. Another example is the operator/task in the form of a DAG in the process of execution, the upstream node will call downstream nodes to obtain the resulting data. These will generate a lot of calls.
- Type boxing: Because the expression interpreter needs to interpret variables of different data types, it is necessary to wrap these primitive variables into object in Java, which also consumes a lot of resources. This is a question attached to the above problem.
- Branch instruction: The CPU now has parallel pipelining, but if the condition is judged it will not be parallel. This can occur if the type of the data (string or int) is judged, or if a column is not required to be read because of the filtering conditions of the other fields (column storage).
- Cache miss: The CPU cache Hit rate is low because of the way one row of data is processed. (So the solution has been hinted)
At least one of the following two solutions has been added to the above problem in most systems today.
One approach is dynamic code generation, which is not the use of explanatory Uniform code. For example, a + 2 * b
This expression generates code for the corresponding execution language, and can be used directly with the primitive type, rather than with a fixed explanatory code. Specifically, the JVM system, such as Spark Sql,presto, can be reflected, and the C + + system Impala uses LLVM to generate the intermediate code. For determining the branch judgment of the data type, the effect of the dynamic code can eliminate these types of judgments, you can also expand the loop, you can compare the following code, the left is the explanatory code, the right is the dynamic generation of code.
Another method is vectorization (vectorization), the basic idea is to discard the pattern of processing one row at a time, instead of processing a small batch of data (such as 1k rows), of course, the precondition is to use the Columnstore format. In this way, this small batch of continuous data can be put into the cache, the CPU not only reduces the branch instruction, even with SIMD speed up processing speed. The specific implementation refers to the following code, adding a constant to a long field. By representing an array of data, the filters are also loaded into arrays with Selvec, creating a very compact loop:
AddintVecnum,Long[] result,Long[] col1,int[] col2,int[] Selvec) { if(Selvec = =NULL) for(inti = 0; i < Vecnum; i++) Result[i]= Col1[i] +Col2[i]; Else for(inti = 0; i < Vecnum; i++) { intSelidx =Selvec[i]; RESULT[SELIDX]= Col1[selidx] +Col2[selidx]; }}
Io
Because the SQL on Hadoop storage data is in HDFs, so the IO layer optimization is actually mostly hdfs things, the major query engine put forward the need to push. To achieve high-efficiency IO, on the one hand to low latency, shielding unnecessary consumption, on the other hand to high throughput, make full use of each disk. The characteristics currently associated with this aspect are:
- Short-circuit local reads: When the read data is found to be local data, do not go datanode (because a socket connection is to go), but instead use the DFS client to read the native block replica directly. The HDFs parameter is the
dfs.client.read.shortcircuit
and dfs.domain.socket.path
.
- Zero copy: Avoids repeated copy of the data between the kernel buffer and the user buffer, which has already been implemented in earlier HDFs.
- Disk-aware scheduling: By knowing each block's disk, you can schedule CPU resources to have different CPUs read different disks and avoid the IO competition between queries and queries. The HDFs parameter is
dfs.datanode.hdfs-blocks-metadata.enabled
.
Storage format
For the analysis type of workload, the best storage format is naturally Columnstore, which has been proven in the relational database era. There are currently two major columnstore formats in the Hadoop ecosystem, one developed by Hortonworks and Microsoft and the other by Cloudera and Twitter parquet. Orcfile
Orcfile, as the name implies, is on the basis of rcfile transformation. Rcfile, although known as Columnstore, is simply "Columnstore", dividing the data into row group, and then storing the row group internally as a column. There are no key features of Columnstore, which are already used in the previous column database (such as the infobright I used before). Fortunately, Orcfile has made up these features, including:
- Block filtering and block statistics: each column is further segmented by a fixed number of rows or sizes, and the min/max/sum/count/null values of these cells are calculated beforehand for each data unit that is sliced, and Min/max is used to skip the data unit directly when filtering the data. All of these statistics can be used directly in the aggregation operation without having to unravel the data unit for further calculations.
- More efficient coding: the type of each column is not labeled in rcfile, in fact, when you know the data type, you can take a specific encoding method, itself can be very much data compression. Common encoding for Columnstore is RLE (large amount of duplicate data), dictionary (String), bitmap (numeric and cardinality is small), differential (sorted data, such as user access time in the log) and so on.
The structure of orcfile, for example, data is divided into row group by default 256M, also called strip. Each strip is provided with an index that holds the Min/max value of each data unit (default 10000 rows) for filtering, and the data is serialized into stream according to the encoding described above, followed by snappy or GZ compression. Footer provides location information to read the stream, as well as more statistical values such as Sum/count. The trailing file footer and Post script provide global information, such as the number of rows per strip, column data types, compression parameters, and so on.
Parquet is designed to be similar to the ORC, but it has two features:
- Versatility: Compared to orcfile specifically for hive use, parquet is not just for Impala, but also for other query tools, such as Hive, Pig, and further docking with AVRO/THRIFT/PB and other serialization formats.
- Nested format storage based on Dremel thought: The relational database design pattern is opposed to storing complex formats (violating the first paradigm), but today's big data calculations not only present this requirement (semi-structured data), but also can efficiently implement storage and query efficiency, and also have the appropriate syntax support (various UDFs, Hive's lateral view, etc.). Google Dremel at the implementation level to make a paradigm, parquet is completely modeled after Dremel.
The difficulty in Columnstore a nested format is that it needs to be labeled with data clearly, by marking which storage structure the data corresponds to, or which record it is stored in. In Dremel, the definition level and repetition level are proposed for marking. The definition level refers to the first layer of the record in the nested structure, while the repetition level refers to the record relative to the previous record, repeating on the first layer. For example, a two-level nested array. The E and F in the diagram belong to the second level of the repeating record (the same level2), so the R value of F is 2, and C and D are different level2, but belong to the same level1, so the R value of D is 1. For the top level (a new nested structure), the R value is 0.
But that's just not enough. Explains the function of the R value, but does not explain the effect of the D value, because, literally, the D value is available to each field according to the schema, why is it marked from the row record level? This is because some null values are inserted in the record, and these null values represent that they "can exist" but because they are repeated or optional so there is no value, the null value is used for the placeholder (or "imagined"), so their values need to be calculated separately. A null D-value means that the structure goes up to which layer (excluding peers) is not null (not imagined). In Dremel paper there is a complete example, the first null in the example country in the structure of code = EN, then language is not null (regardless of code, he and country level), he is the second layer Or, for example, country's second null in the structure of the URL = Http://B, then name is NOT NULL (regardless of the URL, because he is the same as the null language level), so it is the first layer.
In this way, storage is done for a tree-like nested format. It can be traversed by constructing a state machine while reading.
Interestingly, although Parquet supports nested formats, Impala has not had time to add complex formats such as Array,map,struct like Hive, which, of course, has been included in the roadmap, and is believed to be coming soon.
In our recent Impala2.0 test, the impact of the storage format was tested by the way. Parquet compared to the sequencefile in the compression ratio of 1:5, the query performance is also 5-10 times different, it is shown that the column stores an upgrade to the query engine.
Resource control run-time resource adjustment
The number of tasks for an Mr Job,reduce has always been a nuisance to human estimation, and MR-based hive estimates only roughly based on the size of the data source, regardless of the specific Job logic. However, in the framework of the following, this situation is taken into account, which increases the function of adjusting resource allocation at runtime. In Tez, vertex manager is introduced to determine the task required by the reduce action based on data intelligence collected at run time. Similar functions are mentioned in Tajo, called Progressive query optimization, and Tajo can not only adjust the number of tasks, but also adjust the join sequence.
Resource integration
In a time when Hadoop has entered 2.x, all the SQL on Hadoop systems that want to be widely used are bound to be integrated with yarn. Although this is a good thing for the rational use of resources, but because of the addition of yarn this layer, but to the performance of the system has brought some obstacles, because the start Appmaster and application container will also occupy a lot of time, especially the former, and container supply if the time is broken , it will greatly affect the timeliness. These issues are addressed in Tez and Impala in a corresponding way:
- Appmaster start delay issue, take long lived app Master,appmaster after launch, rather than the one am per Job like Mr. When implemented, you can assign an AM pool to each queue configured for Fair Scheduler or Capacity Scheduler, with a certain amount of AM as the task service submitted to the queue.
- Container supply problem, in Tez took the way of container multiplexing, a bit like the JVM reuse, that is, the container after the use of not immediately released, and so on for some time, is not the right task to take over and then release, This not only reduces the likelihood of container, but also caches the result of the previous task to the next task reuse, such as map Join;impala in a more aggressive way, Once all the container have been allocated in place to start executing the query, this way can also make its pipelining calculation does not block.
Other
Up to here, has been from top to bottom of the technology used at all levels, of course, SQL on Hadoop itself is quite complex, involving all aspects, time and energy is limited impossible to ponder. For example, some other features with technical complexity are:
- Multi-Data source query: Presto support from Mysql,cassandra, even Kafka to read data, which greatly reduces the data integration time, do not need to put in HDFs to query. Impala and Hive also support querying hbase. Spark SQL also started supporting external Datasource in version 1.2. There are similar jobs in the country, such as the second hand to transform Impala so that it can query postgres.
- Approximate query: Count distinct (cardinality estimation) has been one of the SQL performance killers, if you can accept a certain error, you can use approximate algorithm. The approximate algorithm (NDV) has been implemented in Impala, and Presto is requested to blinkdb cooperation. Both are used Hyperloglog counting. Of course, not only count distinct can use approximate algorithms, but others, such as median, can also be used.
Conclusion
Although the relevant systems are now many, but also after several years of development, but the current system is still constantly improving, such as:
- Increase the expansion of analytic functions, complex data types, and SQL syntax sets.
- The technology that has been formed is also being improved, such as column storage can also add more encoding ways.
- Even for areas like the CBO, the open-source world is just getting started, which is a lot worse than the orca in Hawq, a business system.
After all, compared to the relatively mature relational database, distributed environment needs to solve more problems, there will be a lot of exciting technical practice in the future, let us in the massive data faster and more convenient to find the desired data.
Key technologies used in the inventory of SQL on Hadoop