Catalyst Optimizer, catalystoptimizer
For more information, see ChenZhongPu's Gitbook:
Https://www.gitbook.com/book/chenzhongpu/bigdatanotes/details
The Catalyst of Spark SQL is easy to expand. It also supports rule-based and cost-based optimization methods.
Inside it, Catalyst contains a common library that represents the tree and operation tree rules. Under this framework, databases for link query processing (such as expressions and logical query plans) are implemented, and different stages (analysis, logical optimization, physical optimization, code Generation.
Tree
The main data type in Catalyst is a tree composed of node objects. Each node has a node type and up to 0 children. All new node types are in Scala.TreeNode
.
Add(Attribute(x), Add(Literal(1), Literal(2)))
Rules
You can use rules to manipulate a tree, that is, to convert a tree into another tree. Although rules can execute arbitrary code on the input tree (because the tree is only a Scala object ), however, the common practice is to use a series of pattern matching methods to find and replace the subtree with specific results.
For example, the Add operation between folds constants is implemented below:
tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) }
In this wayx+(1+2)
To generate a new tree.x+3
.
In a conversion call, the rule can be matched multiple times.
Finally, the condition and content of the rule can contain any code. This makes it easier for the Catalyst to be a beginner.
In addition, the transformation of the tree is carried out on the immutable tree, which is easy to debug and more conducive to parallel optimizer.
Using Catalyst in Spark SQL
Analysis
Whether it is the AST (abstract syntax tree) obtained by SQL parser or the DataFrame object built using APIs, the relation certainly contains the unresolved attribute reference or link: for exampleSELECT col FROM sales
Until we query the sales table, we know the col type, and even whether the col is a legal column name.
An attribute is called unresolved if we do not know its type or have not matched it to an input table (or an alias)
Spark SQL uses the tables Catalog object in the Catalyst rule and tracing data source to parse these attributes.
This part of code is located inorg.apache.spark.sql.catalyst.analysis.Analyzer.scala
.
Logical Optimizations
Logical optimization is based entirely on rules.
Including constant folding, predicate pushdown, project pruning, null propagation, and simplified BOOL expressions.
Predicata pushdown: Move the predicates in the WHERE clause of the outer query block into the lower-level query blocks (such as views), so as to filter data early and possibly make better use of indexes.
Adding rules is also simple for different situations.
For example, the following simplified LIKE expression:
object LikeSimplification extends Rule[LogicalPlan] { val startsWith = "([^_%]+)%".r val endsWith = "%([^_%]+)".r val contains = "%([^_%]+)%".r val equalTo = "([^_%]*)".r def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { case Like(l, Literal(startsWith(pattern), StringType)) if !pattern.endsWith("\\") => StartsWith(l, Literal(pattern)) case Like(l, Literal(endsWith(pattern), StringType)) => EndsWith(l, Literal(pattern)) case Like(l, Literal(contains(pattern), StringType)) if !pattern.endsWith("\\") => Contains(l, Literal(pattern)) case Like(l, Literal(equalTo(pattern), StringType)) => EqualTo(l, Literal(pattern)) }}
The comments of this Code are:
>
Simplifies LIKE expressions that do not need full regular expressions to evaluate the condition.
For example, when the expression is just checking to see if a string starts with a given pattern.
This part of code is located inorg.apache.spark.sql.catalyst.optimizer.Optimizer.scala
.
Physical Planning
In the physical plan phase, Spark SQL generates one or more physical plans through logical plans, and then selects a plan using the cost model.
Currently, the code model is used only when the Join algorithm is selected: If a relation is small, SparkSQL uses broadcast join and uses the broadcast feature of peer-to-peer. The price model can be applied to other algorithms in the future.
The physical plan also performs rule-based optimization.
This part of code is located inorg.apache.spark.sql.execution.SparkStrategies.scala
.
Code Generation
Generating a Java subcode at runtime is the query optimization at the final stage.
Because Spark SQL basically operates on memory datasets, Which is CPU-bound, code generation can be accelerated.
The code generation engine is hard to implement, basically equivalent to a compiler. However, it depends on the new features of Scala.Quasiquotes
To make it much simpler. Quasiquotes allows the program to construct abstract syntax trees and submit it to the compiler at runtime to generate subcode.
For example(x+y)+1
If no code is generated, such an expression will be interpreted for each row of data to traverse the node of the tree. This will introduce a large number of branch and virtual method calls.
def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}"}
This part of code is located inorg.spark.sql.catalyst.expressions.codegen.CodeGenerator.scala
.