Recently, big data-related technologies and traditional database technology are a lot of integration and mutual reference. The traditional database strength lies in its proven SQL optimizer experience, the weakness lies in the distributed field of high availability, fault tolerance, scalability, and so on, in time, let it undergo a certain transformation, such as the introduction of Paxos, raft, etc., to strengthen their ability in the distributed domain, I believe we will have a place in the big Data system. In contrast, big data-related technology has its inherent scalability, usability, and fault tolerance, but its SQL optimizer experience is largely rooted in traditional databases, and of course, there are some optimization strategies for columnstore Big Data SQL Optimizer. This paper mainly introduces the catalyst of the optimizer system of Sparksql, which is basically based on the traditional database, and is basically the same design as most current big data SQL processing engine (Impala, Presto, Hive (calcite), etc.). As a result, learning from this article also provides a basic understanding of how all other SQL processing engines work. The SQL Optimizer core execution strategy is divided into two major directions: rule-Based Optimization (RBO) and cost-based optimization (CBO), rule based optimization is an empirical, heuristic optimization idea, relying more on the optimization rules summarized by predecessors, simple and can cover most of the optimization logic, But for the core optimization operator join is a bit of a struggle. As a simple example, should two tables perform a join using Broadcasthashjoin or Sortmergejoin? The current way of Sparksql is determined by manually setting the parameters, if the data volume of a table is less than this value to use Broadcasthashjoin, but this scheme seems very elegant, very inflexible. The cost optimization is to solve this problem, it will evaluate the current two tables using each join strategy for each join, based on the cost estimates to determine a cost-minimization scenario. This article will focus on rule-based optimization strategies, and subsequent articles will detail the cost-based optimization strategy. The Red box section will be the focus of this article:Preliminary Knowledge-tree&ruleBefore describing how the SQL optimizer works, it is necessary to first introduce two important data structures: tree and rule. I believe that no matter what the SQL optimizer knows, you certainly know the concept of SQL syntax tree, the SQL syntax tree is the SQL statement after the compiler will be parsed into a tree structure. This tree contains a number of node objects, each with a specific data type, and 0 or more child nodes (the node object is defined in code as a TreeNode object), which is a simple example: as shown, there are 3 data types for the left expression of the arrows (literal represents constants, attribute represents a variable, an add represents an action, and represents a x+ (1+2). Once mapped to the right tree structure, each data type becomes a node. In addition, tree has a very important feature that can be transformed by a certain rule equivalent, such as: Define an equivalent transformation rule (rule): Two integer type of constant addition can be converted to an integer constant, this rule is very simple, For the expression mentioned above, x+ (1+2) can be converted to x+3. How do you find two integer constants for a program? In fact, is a simple two-fork tree traversal algorithm, each traversal to a node, the pattern matching the current node is the add, the left and right child nodes are the structure of the integer constant, and then the three nodes are positioned to replace the node with a literal type. The above uses one of the simplest examples to illustrate equivalence transformation rules and how to apply rules to the syntax tree. In any SQL optimizer, a large number of rule (as discussed later) will be defined, and the SQL optimizer will traverse each node in the syntax tree to match all given rules (rule) for the traversed node pattern, and if a match succeeds, the conversion is made, and if all the rules fail, Continue traversing the next node.Catalyst Work FlowAny optimizer works the same thing: The SQL statement is first parsed into a syntax tree through the parser module, which is called unresolved Logical plan;unresolved Logical Plan through the Analyzer module with the aid of data meta-data analysis to Logical plan, and then through a variety of rules-based optimization strategy for deep optimization, to get optimized Logical plan; The optimized logical execution plan is still logical, Cannot be understood by the spark system, this logic execution plan needs to be converted to physical plan; To better understand the entire process, the following is explained by a simple example.ParserParser simply divides the SQL string into a token and parses it into a syntax tree based on a certain semantic rule. The parser module is basically implemented using a third-party class library ANTLR, such as Hive, Presto, Sparksql, and so on. is an exemplary SQL statement (there are two tables, where the people table primarily stores the user's basic information, the score table stores the various accomplishments of the user), through the parser parsed AST syntax tree as shown in the image on the right:AnalyzerThrough the analytic logical execution plan basically has the skeleton, but the system does not know score, sums these are what ghosts, at this time needs the basic metadata information to express these morphemes, the most important meta data information mainly includes two parts: the table scheme and the basic function information, The scheme of the table mainly includes the basic definition of the table (column name, data type), the data format of the table (Json, Text), the physical location of the table, etc., the basic function information mainly refers to the class information. Analyzer iterates through the entire syntax tree, binds data type bindings to each node in the tree, and functions, such as people, which resolves to a table containing the age, ID, and name three columns based on the metadata table information. People.age is parsed into a variable with a data type of int, and sum is parsed into a specific aggregate function, as shown in: Analyzer in Sparksql defines various parsing rules, and children's shoes interested in in-depth viewing can view the Analyzer class, which defines the basic parsing rules, as follows:optimizer The optimizer is the core of the catalyst as mentioned above, the optimizer is divided into rule-based optimization and cost-based optimization, the current Sparksql 2.1 is still not well supported based on the cost optimization (described below), here only the rule-based optimization strategy, The rule-based optimization strategy is actually a traversal of the syntax tree, the pattern matching can satisfy the specific rules of the node, and then the corresponding equivalent conversion. Therefore, rule-based optimization is essentially a tree converted to another tree equivalently. There are a number of classic optimization rules in SQL, and the following examples show three more common rules: predicate push (predicate pushdown), constant accumulation (Constant folding), and column-value cropping (columns pruning). On the left is the parser-parsed syntax tree, where two tables in the syntax tree join first and then filter the results using AGE>10. As you know, the join operator is usually a very time-consuming operator, and how much time it takes is generally dependent on the size of the two tables participating in the join, and if you can reduce the size of the join two tables, you can significantly reduce the join operator. predicate push is a function that pushes the filtering action forward before the join, and the filter conditions age>0 and id!=null two conditions are pushed down to the join. In this way, the system scans the data when the data is filtered, the amount of data participating in the join will be significantly reduced, the join time will inevitably be reduced. The constant accumulation is very simple, that is, the rule x+ (1+2) -> x+3 mentioned above, although it is a small change, but significant. If the example is not optimized, each result needs to be 100+80 once, then added to the variable math_score and English_score, and the 100+80 operation is not required after optimization. Column-value cropping is another classic rule, in which the people table does not need to scan all of its column values, only the column value ID, so the other columns need to be cropped after the people is scanned, leaving only the column ID. This optimization significantly reduces the amount of network, memory data consumption, on the other hand, for the database (parquet) significantly improved scanning efficiency. In addition to this, catalyst also defines a number of other optimization rules that interest in-depth understanding of children's shoes can be viewed Optimizer class, a simple interception of some rules: so far, the logical execution plan has been better optimized, however, the logical execution plan is still not able to really implement, They're just logical, and actually spark doesn't know how to do it. For example, join is an abstract concept that represents two tables based on the sameID is merged, however, the logical execution plan does not specify how to implement the merge. You need to transform your logical execution plan into a physical execution plan, transforming a logically feasible execution plan into a plan that spark can actually execute. For example, the join operator, Spark has developed different algorithm strategies for the operator based on different scenarios, such as Broadcasthashjoin, Shufflehashjoin, and Sortmergejoin (which can be understood as an interface, Broadcasthashjoin is one of the specific implementations, and the physical execution plan is actually picking a time-consuming, minimal algorithm implementation in these specific implementations, which involves a cost-based optimization strategy, followed by a detailed article.sparksql Execution PlanAt this point, the author introduces a complete workflow of catalyst through a simple example, including the parser phase, the analyzer phase, the optimize phase, and the physical planning phase. Some students may be more interested in how to view the entire process of a specific SQL in the spark environment, here are two methods: 1. Use the Queryexecution method to view the logical execution plan, and use the Explain method to view the physical execution plan, as shown in the following: 2. Use Spark WebUI to view it as shown in: Reference article: 1. Deep Dive into Spark SQL ' s Catalyst optimizer:https://databricks.com/blog/2015/04/13/ DEEP-DIVE-INTO-SPARK-SQLS-CATALYST-OPTIMIZER.HTML2. A deep dive into spark SQL ' s catalyst optimiser:https:// Www.youtube.com/watch?v=GDeePbbCz2g&index=4&list=PL-x35fyliRwheCVvliBZNm1VFwltr5b6B3. Spark sql:relational Data processing in spark:http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf This article is reproduced from http://hbasefly.com/2017/03/01/sparksql-catalyst/
sparksql– from 0 to 1 Understanding Catalyst (reprint)