Catalyst Optimizer, catalystoptimizer

Source: Internet
Author: User

Catalyst Optimizer, catalystoptimizer

For more information, see ChenZhongPu's Gitbook:
Https://www.gitbook.com/book/chenzhongpu/bigdatanotes/details

The Catalyst of Spark SQL is easy to expand. It also supports rule-based and cost-based optimization methods.

Inside it, Catalyst contains a common library that represents the tree and operation tree rules. Under this framework, databases for link query processing (such as expressions and logical query plans) are implemented, and different stages (analysis, logical optimization, physical optimization, code Generation.

Tree

The main data type in Catalyst is a tree composed of node objects. Each node has a node type and up to 0 children. All new node types are in Scala.TreeNode.

Add(Attribute(x), Add(Literal(1), Literal(2)))

Rules

You can use rules to manipulate a tree, that is, to convert a tree into another tree. Although rules can execute arbitrary code on the input tree (because the tree is only a Scala object ), however, the common practice is to use a series of pattern matching methods to find and replace the subtree with specific results.

For example, the Add operation between folds constants is implemented below:

    tree.transform {      case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)    }

In this wayx+(1+2)To generate a new tree.x+3.

In a conversion call, the rule can be matched multiple times.

Finally, the condition and content of the rule can contain any code. This makes it easier for the Catalyst to be a beginner.

In addition, the transformation of the tree is carried out on the immutable tree, which is easy to debug and more conducive to parallel optimizer.

Using Catalyst in Spark SQL

Analysis

Whether it is the AST (abstract syntax tree) obtained by SQL parser or the DataFrame object built using APIs, the relation certainly contains the unresolved attribute reference or link: for exampleSELECT col FROM salesUntil we query the sales table, we know the col type, and even whether the col is a legal column name.

An attribute is called unresolved if we do not know its type or have not matched it to an input table (or an alias)

Spark SQL uses the tables Catalog object in the Catalyst rule and tracing data source to parse these attributes.

This part of code is located inorg.apache.spark.sql.catalyst.analysis.Analyzer.scala.

Logical Optimizations

Logical optimization is based entirely on rules.

Including constant folding, predicate pushdown, project pruning, null propagation, and simplified BOOL expressions.

Predicata pushdown: Move the predicates in the WHERE clause of the outer query block into the lower-level query blocks (such as views), so as to filter data early and possibly make better use of indexes.

Adding rules is also simple for different situations.

For example, the following simplified LIKE expression:

object LikeSimplification extends Rule[LogicalPlan] {  val startsWith = "([^_%]+)%".r  val endsWith = "%([^_%]+)".r  val contains = "%([^_%]+)%".r  val equalTo = "([^_%]*)".r  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {    case Like(l, Literal(startsWith(pattern), StringType)) if !pattern.endsWith("\\") =>      StartsWith(l, Literal(pattern))    case Like(l, Literal(endsWith(pattern), StringType)) =>      EndsWith(l, Literal(pattern))    case Like(l, Literal(contains(pattern), StringType)) if !pattern.endsWith("\\") =>      Contains(l, Literal(pattern))    case Like(l, Literal(equalTo(pattern), StringType)) =>      EqualTo(l, Literal(pattern))  }}

The comments of this Code are:
>
Simplifies LIKE expressions that do not need full regular expressions to evaluate the condition.
For example, when the expression is just checking to see if a string starts with a given pattern.

This part of code is located inorg.apache.spark.sql.catalyst.optimizer.Optimizer.scala.

Physical Planning

In the physical plan phase, Spark SQL generates one or more physical plans through logical plans, and then selects a plan using the cost model.

Currently, the code model is used only when the Join algorithm is selected: If a relation is small, SparkSQL uses broadcast join and uses the broadcast feature of peer-to-peer. The price model can be applied to other algorithms in the future.

The physical plan also performs rule-based optimization.

This part of code is located inorg.apache.spark.sql.execution.SparkStrategies.scala.

Code Generation

Generating a Java subcode at runtime is the query optimization at the final stage.

Because Spark SQL basically operates on memory datasets, Which is CPU-bound, code generation can be accelerated.

The code generation engine is hard to implement, basically equivalent to a compiler. However, it depends on the new features of Scala.QuasiquotesTo make it much simpler. Quasiquotes allows the program to construct abstract syntax trees and submit it to the compiler at runtime to generate subcode.

For example(x+y)+1If no code is generated, such an expression will be interpreted for each row of data to traverse the node of the tree. This will introduce a large number of branch and virtual method calls.

def compile(node: Node): AST = node match {  case Literal(value) => q"$value"  case Attribute(name) => q"row.get($name)"  case Add(left, right) => q"${compile(left)} + ${compile(right)}"}

This part of code is located inorg.spark.sql.catalyst.expressions.codegen.CodeGenerator.scala.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.