Take a peek and bring you into the Sparksql world

Source: Internet
Author: User
Tags data structures

This article is published by NetEase Cloud.

Van Hin Xin (This article is only for internal sharing, if you need to reprint, please obtain the consent of the author authorization.) )

Recently, big data-related technologies and traditional database technology are a lot of integration and mutual reference. The traditional database strength lies in its proven SQL optimizer experience, the weakness lies in the distributed field of high availability, fault tolerance, scalability, and so on, in time, let it undergo a certain transformation, such as the introduction of Paxos, raft, etc., to strengthen their ability in the distributed domain, I believe we will have a place in the big Data system. In contrast, big data-related technology has its inherent scalability, usability, and fault tolerance, but its SQL optimizer experience is largely rooted in traditional databases, and of course, there are some optimization strategies for columnstore Big Data SQL Optimizer.

This paper mainly introduces the catalyst of the optimizer system of Sparksql, which is basically based on the traditional database, and is basically the same design as most current big data SQL processing engine (Impala, Presto, Hive (calcite), etc.). As a result, learning from this article also provides a basic understanding of how all other SQL processing engines work.

The SQL Optimizer core execution strategy is divided into two major directions: rule-Based Optimization (CRO) and cost-based optimization (CBO), rule based optimization is an empirical, heuristic optimization idea, relying more on the optimization rules summarized by predecessors, simple and can cover most of the optimization logic, But for the core optimization operator join is a bit of a struggle. As a simple example, should two tables perform a join using Broadcasthashjoin or Sortmergejoin? The current way of Sparksql is determined by manually setting the parameters, if the data volume of a table is less than this value to use Broadcasthashjoin, but this scheme seems very elegant, very inflexible. The cost optimization is to solve this problem, it will evaluate the current two tables using each join strategy for each join, based on the cost estimates to determine a cost-minimization scenario.

This article will focus on rule-based optimization strategies, and subsequent articles will detail the cost-based optimization strategy. The Red box section will be the focus of this article:

Preliminary knowledge-tree&rule

Before describing how the SQL optimizer works, it is necessary to first introduce two important data structures: tree and rule. I believe that no matter what the SQL optimizer knows, you certainly know the concept of SQL syntax tree, the SQL syntax tree is the SQL statement after the compiler will be parsed into a tree structure. This tree contains a number of node objects, each with a specific data type, and 0 or more child nodes (the node object is defined in code as a TreeNode object), which is a simple example:

As shown, the left-hand expression of the arrow has 3 data types (literal represents a constant, attribute represents a variable, an add represents an action) and represents x+ (1+2). Once mapped to the right tree structure, each data type becomes a node. In addition, tree has a very important feature that can be transformed by certain rules, such as:

Defines an equivalent transform rule (rule): The addition of two integer constants can be converted to an integer constant, which is very simple and can be transformed into x+3 for the expression x+ (1+2) mentioned above. How do you find two integer constants for a program? In fact, is a simple two-fork tree traversal algorithm, each traversal to a node, the pattern matching the current node is the add, the left and right child nodes are the structure of the integer constant, and then the three nodes are positioned to replace the node with a literal type.

The above uses one of the simplest examples to illustrate equivalence transformation rules and how to apply rules to the syntax tree. In any SQL optimizer, a large number of rule (as discussed later) will be defined, and the SQL optimizer will traverse each node in the syntax tree to match all given rules (rule) for the traversed node pattern, and if a match succeeds, the conversion is made, and if all the rules fail, Continue traversing the next node.

Catalyst Work Flow
Any optimizer works the same thing: The SQL statement is first parsed into a syntax tree through the parser module, which is called unresolved Logical plan;unresolved Logical Plan through the Analyzer module with the aid of data meta-data analysis to Logical plan, and then through a variety of rules-based optimization strategy for deep optimization, to get optimized Logical plan; The optimized logical execution plan is still logical, Cannot be understood by the spark system, this logic execution plan needs to be converted to physical plan; To better understand the entire process, the following is explained by a simple example.

Parser

Parser simply divides the SQL string into a token and parses it into a syntax tree based on a certain semantic rule. The parser module is basically implemented using a third-party class library ANTLR, such as Hive, Presto, Sparksql, and so on. is an exemplary SQL statement (there are two tables, where the people table primarily stores the user's basic information, the score table stores the various accomplishments of the user), through the parser parsed AST syntax tree as shown in the image on the right:

Analyzer

Through the analytic logical execution plan basically has the skeleton, but the system does not know score, sums these are what ghosts, at this time needs the basic metadata information to express these morphemes, the most important meta data information mainly includes two parts: the table scheme and the basic function information, The scheme of the table mainly includes the basic definition of the table (column name, data type), the data format of the table (Json, Text), the physical location of the table, etc., the basic function information mainly refers to the class information.

Analyzer iterates through the entire syntax tree, binds data type bindings to each node in the tree, and functions, such as people, which resolves to a table containing the age, ID, and name three columns based on the metadata table information. People.age is parsed into a variable of type int, and sum is parsed into a specific aggregate function, as shown in:

Sparksql Analyzer defines a variety of analytic rules, children's shoes interested in in-depth understanding can view the Analyzer class, which defines the basic parsing rules, as follows:

Optimizer

The optimizer is the core of the catalyst as mentioned above, the optimizer is divided into rule-based optimization and cost-based optimization, the current Sparksql 2.1 is still not well supported based on the cost optimization (described below), here only the rule-based optimization strategy, The rule-based optimization strategy is actually a traversal of the syntax tree, the pattern matching can satisfy the specific rules of the node, and then the corresponding equivalent conversion. Therefore, rule-based optimization is essentially a tree converted to another tree equivalently. There are a number of classic optimization rules in SQL, and the following examples show three more common rules: predicate push (predicate pushdown), constant accumulation (Constant folding), and column-value cropping (columns pruning).

On the left is the parser-parsed syntax tree, where two tables in the syntax tree join first and then filter the results using AGE>10. As you know, the join operator is usually a very time-consuming operator, and how much time it takes is generally dependent on the size of the two tables participating in the join, and if you can reduce the size of the join two tables, you can significantly reduce the join operator. predicate push is a function that pushes the filtering action forward before the join, and the filter conditions age>0 and id!=null two conditions are pushed down to the join. In this way, the system scans the data when the data is filtered, the amount of data participating in the join will be significantly reduced, the join time will inevitably be reduced.

The constant summation is actually very simple, that is, the rule x+ (1+2), x+3, which is mentioned above, is a small change, but of great significance. If the example is not optimized, each result needs to be 100+80 once, then added to the variable math_score and English_score, and the 100+80 operation is not required after optimization.

Column-value cropping is another classic rule, in which the people table does not need to scan all of its column values, only the column value ID, so the other columns need to be cropped after the people is scanned, leaving only the column ID. This optimization significantly reduces the amount of network, memory data consumption, on the other hand, for the database (parquet) significantly improved scanning efficiency.

In addition to this, catalyst also defines a number of other optimization rules that interest in-depth children's shoes can view the optimizer class and simply intercept a subset of the rules:

At this point, the logical execution plan has been better optimized, however, the logical execution plan is still not able to really execute, they are only logically feasible, in fact, Spark does not know how to execute this thing. For example, join is just an abstract concept that represents two tables merging according to the same ID, however, the logical execution plan does not specify how to implement the merge.

You need to transform your logical execution plan into a physical execution plan, transforming a logically feasible execution plan into a plan that spark can actually execute. For example, the join operator, Spark has developed different algorithm strategies for the operator based on different scenarios, such as Broadcasthashjoin, Shufflehashjoin, and Sortmergejoin (which can be understood as an interface, Broadcasthashjoin is one of the specific implementations, and the physical execution plan is actually picking a time-consuming, minimal algorithm implementation in these specific implementations, which involves a cost-based optimization strategy, followed by a detailed article.

Sparksql Execution Plan

At this point, the author introduces a complete workflow of catalyst through a simple example, including the parser phase, the analyzer phase, the optimize phase, and the physical planning phase. Some students may be more interested in how to view the entire process of a specific SQL in the spark environment, here are two approaches:

1. Use the Queryexecution method to view the logical execution plan and use the Explain method to view the physical execution plan, as shown below:

2. Use Spark WebUI to view it as shown in:

Reference article:

1. Deep Dive into spark SQL's Catalyst optimizer:deep Dive into spark SQL ' s Catalyst Optimizer

2. A deep Dive into Spark SQL ' s Catalyst optimiser:https://www. Youtube.com/watch? v=gdeepbbcz2g&index=4&list=pl-x35fylirwhecvvlibznm1vfwltr5b6b

3. Spark sql:relational Data processing in Spark:/http/people.csail.mit.edu/matei/papers/2015/sigmod_ Spark_sql.pdf

4. Lifetime of a spark SQL job:/http/ks.netease.com/blog? id=3125

NetEase has: Enterprise-Class Big Data visualization analysis platform. Self-service Agile analysis Platform for business people, using PPT mode report making, easy to learn and use, with powerful exploration and analysis capabilities, truly help users to insight into the value of data discovery. Click here to try it for free.

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User package: Https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

Take a peek and bring you into the Sparksql world

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.