Spark SQL Catalyst Source code Analysis TreeNode Library

Source: Internet
Author: User
Tags reflection unique id

The previous articles introduced the spark SQL Catalyst Sqlparser, and analyzer, originally intended to write optimizer directly, but found forgetting to introduce TreeNode, the core concept of catalyst, This article explains how to better understand how optimizer is generating optimized Logical plan for optimizing analyzed Logical plan, which is explained by the TreeNode infrastructure.

First, TreeNode type TreeNode library is the catalyst core class libraries, the construction of the syntax tree is composed of a TreeNode. TreeNode itself is a type of BaseType <: Treenode[basetype] and implements the product trait, which allows for the storage of heterogeneous elements.
There are three forms of TreeNode: BinarynodeUnarynodeLeaf Node.
in catalyst, these nodes are inherited from logical plan, which can be said that each TreeNode node is a logical plan. Except expression (which is directly inherited from TreeNode)

The main inheritance relationship class diagram is as follows:


1, Binarynode

A binary node, that is, a two-fork node with left and right children

[[[TreeNode]] that has both children, [[left]] and [[Right]].trait Binarynode[basetype <: Treenode[basetype]] {  def l Eft:basetype  def right:basetype  def children = Seq (left, right)}abstract class Binarynode extends Logicalplan WI Th trees. Binarynode[logicalplan] {  self:product =>}
Node definition is relatively simple, left child, right child is basetype. Children is a seq (left, right)

The following is a list of classes that inherit the two-tuple node, which can be used as a query manual:)

Here are the usual common two-dollar nodes:Join and Union


2, Unarynode

A unary node, that is, only one child node

A [[TreeNode]] with a single [[Child]].trait Unarynode[basetype <: Treenode[basetype]] {  def child:basetype  D EF children = Child:: Nil}abstract class Unarynode extends Logicalplan with trees. Unarynode[logicalplan] {  self:product =>}
The following is a list of classes that primarily inherit a unary node, which can be used when querying the manual:)

Common two-dollar nodes are,Project,subquery,Filter,Limit ... such as

3. Leaf Node

leaf node, node with no child nodes.

A [[TreeNode]] with no children.trait leafnode[basetype <: Treenode[basetype]] {  def children = Nil}abstract Class Leafnode extends Logicalplan with trees. Leafnode[logicalplan] {  self:product =/  /LEAF nodes by definition cannot reference any input attributes.
   
    override def references = Set.empty}
   
The following is a list of the main inherited leaf node classes, which can be used when querying the manual:)

Tips for commonly used leaf nodes: Command class series, some Funtion functions, and unresolved Relation... etc.


Second, the TreeNode core method briefly introduces the properties and methods of a TreeNode class

CurrentID
The TreeNode in a tree has a unique ID, and the type is the Java.util.concurrent.atomic.AtomicLong atomic type.

  Private Val CurrentID = new Java.util.concurrent.atomic.AtomicLong  protected def nextid () = Currentid.getandincrement ()
sameinstance
When judging whether 2 instances are the same, just determine the ID of the TreeNode.
  def sameinstance (Other:treenode[_]): Boolean = {    This.id = = Other.id  }
fastequals, a more commonly used quick decision method, without rewriting the object.equals, prevents the Scala compiler from generating the case class Equals method
def fastequals (Other:treenode[_]): Boolean = {    sameinstance (other) | |  
Map,flatmap,collect are recursive sub-nodes for the application of partialfunction, there are many other methods, space is limited here is not described.

2.1, the core method transform methodTransform this method accepts a partialfunction, which is the rule in the batch that the previous article referred to in Analyzer.
is to apply the rule iteration to all the child nodes of the node, and finally to return a copy of the node (a node that is different from the current node, as described later, in fact, using reflection to return a modified node).
If rule does not perform a partialfunction operation on a node, it returns the node itself.

Take a look at an example:

  Object Globalaggregates extends Rule[logicalplan] {    def apply (Plan:logicalplan): Logicalplan = Plan Transform {   The Apply method calls here the transform method of the logical plan (TreeNode) to apply a partialfunction. Case      Project (projectlist, child) if containsaggregates (projectlist) =        Aggregate (Nil, projectlist, child)    }    def containsaggregates (Exprs:seq[expression]): Boolean = {      Exprs.foreach (_.foreach {case        agg:aggregateexpression = return true Case        _ = =      })      false    }  }
The real invocation of this method is Transformchildrendown, which refers to the rule application, which uses the recursion of the first-order-to-child node recursively.
If rule is successfully applied to the current node, the modified node is afterrule to apply rule to its children node.

Transformdown Method:

   /**   * Returns A copy of this node where ' rule ' have been recursively applied to it and all of its   * children (pre-o Rder). When the ' rule ' does not apply to a given node it was left unchanged.   * @param rule the function used to transform this nodes children   *  /def transformdown (rule:partialfunction[basety PE, BaseType]): BaseType = {    val afterrule = Rule.applyorelse (this, identity[basetype])    //Check if unchanged and Then possibly return old copy to avoid GC churn.    if (this fastequals afterrule) {      transformchildrendown (rule)  //Modify pre-node this.transformchildrendown (rule)    } else {      afterrule.transformchildrendown (rule)//modified node Transformchildrendown    }  }
The most important method is Transformchildrendown:
A recursive call to the Children node Partialfunction, using the final returned Newargs to generate a new node, called Makecopy () to generate the node.

Transformchildrendown Method:

   /** * Returns A copy of this node where ' rule ' have been recursively applied to all the children of * this node.   When the ' rule ' does not apply to a given node it was left unchanged. * @param rule the function used to transform this nodes children */def transformchildrendown (Rule:partialfunction[bas EType, BaseType]): This.type = {var changed = False Val Newargs = productiterator.map {case Arg:treenode[_] If children contains arg = val newChild = arg.asinstanceof[basetype].transformdown (rule)//recursive child nodes apply Rule I F (!) ( NewChild fastequals arg) {changed = True NewChild} else {arg} case Some        (Arg:treenode[_]) If children contains arg = val newChild = arg.asinstanceof[basetype].transformdown (rule) if (! (      NewChild fastequals Arg)) {changed = True Some (newChild)} else {Some (ARG)} Case M:map[_,_] = m case Args:traversable[_] = aRgs.map {case Arg:treenode[_] if children contains arg = val NewChild = arg.asinstanceof[basetype].t Ransformdown (rule) if (!) (        NewChild fastequals arg) {changed = True NewChild} else {arg} Case other = other} case nonchild:anyref = nonchild case NULL = null}.toarray if (c  hanged) Makecopy (Newargs) Else this//Newargs array that is returned based on the result of the action, the reflection generates a new copy of the node. }
Makecopy method, Reflection generation node copy
/**   * Creates a copy of this type of the tree node after a transformation.   * Must be overridden by child classes that has constructor arguments * that is not present in the   productiterator.
   * @param newargs The new product arguments.   */  def makecopy (Newargs:array[anyref]): This.type = Attachtree (This, "Makecopy") {    try {      val defaultctor = g EtClass.getConstructors.head  //reflection Gets the default constructor for the first      if (othercopyargs.isempty) {        defaultctor.newinstance ( Newargs: _*). Asinstanceof[this.type]//reflection generates a node of the current node type      } else {        defaultctor.newinstance (Newargs + + Othercopyargs). ToArray: _*). Asinstanceof[this.type]//If there are other parameters, + +      }    } catch {case      e: Java.lang.IllegalArgumentException =        throw new Treenodeexception (this          , s ' Failed to copy node.  is Othercopyargs specified correctly for $nodeName? "            + S" Exception message: ${e.getmessage}. "}  }

TreeNode instance is now ready to start off with a piece of SQL and draw the transformation of this whole spark SQL tree.SELECT * FROM (SELECT * from SRC) a joins (SELECT * from src) b on A.key=b.key first, let's take a look at the generated plan in the console:
<span style= "FONT-SIZE:12PX;" &GT;SBT/SBT Hive/consoleusing/usr/java/default as Default java_home. Note, this would be overridden by-java-home if it is set. [INFO] Loading Project Definition From/app/hadoop/shengli/spark/project/project[info] Loading project Definition from/app/ Hadoop/shengli/spark/project[info] Set Current project-to-root (in build file:/app/hadoop/shengli/spark/) [INFO] Starting Scala interpreter ... [INFO] Import Org.apache.spark.sql.catalyst.analysis._import org.apache.spark.sql.catalyst.dsl._import Org.apache.spark.sql.catalyst.errors._import Org.apache.spark.sql.catalyst.expressions._import Org.apache.spark.sql.catalyst.plans.logical._import Org.apache.spark.sql.catalyst.rules._import Org.apache.spark.sql.catalyst.types._import Org.apache.spark.sql.catalyst.util._import Org.apache.spark.sql.executionimport Org.apache.spark.sql.hive._import Org.apache.spark.sql.hive.test.testhive._ Import Org.apache.spark.sql.parquet.ParquetTestData scala> val query = SQL ("SeleCT * FROM (SELECT * from SRC) a joins (SELECT * from src) b on A.key=b.key ") </span> 
3.1. Unresolve Logical Plan The first step to generate Unresolve Logical plan is as follows:
scala> Query.queryExecution.logicalres0:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [*] Join Inner, Some ((' A.key = ' B.key))  subquery a   Project [*]    unresolvedrelation None, SRC, none  subquery b   Project [*]    Unresolvedrelation None, SRC, none
If you draw a tree like this, only personal understanding: I will introduce the first three kinds of Node are green unarynode, red binary node and blue Leafnode to express.

3.2. Analyzed Logical Plan Analyzer will allow batch rules to apply rule to the unresolved Logical plan Tree, This is used to eliminateanalysisoperators subquery to eliminate, Batch ("Resolution Atrribute and relation to resolve, Analyzed Logical Plan Tree such as:
3.3. Optimized Plan I dubbed the Catalyst Optimizer as the optimizer of Spark SQL, because the entire spark SQL optimization was done here, and there will be articles to explain the optimizer. Here, the optimization is not obvious, because SQL itself is not complex
Scala> Query.queryExecution.optimizedPlanres3:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [key#0,value#1,key#2,value#3] Join Inner, Some ((key#0 = key#2))  metastorerelation default, SRC, None  Metastorerelation default, SRC, None
The resulting tree is as follows:

3.4, Executedplan The final step is the final physical execution plan, which involves the tablescan of hive, involves the hashjoin operation, also involves exchange, Exchange involves shuffle and partition operations.
scala> Query.queryExecution.executedPlanres4:org.apache.spark.sql.execution.SparkPlan = Project [key#0:0,value# 1:1,key#2:2,value#3:3] Hashjoin [key#0], [key#2], buildright  Exchange (hashpartitioning [key#0:0], Max)   Hivetablescan [key#0,value#1], (metastorerelation default, SRC, none), none  Exchange (hashpartitioning [key#2:0],   Hivetablescan [key#2,value#3], (metastorerelation default, SRC, none), none
The resulting physical execution tree
Iv. Summary: This paper introduces the Catalyst Framework core TreeNode Class library of Spark SQL, draws the class diagram of TreeNode inheritance relation, and understands the role of TreeNode in Catalyst. The logical plan in the syntax tree derives from TreeNode, and logical plan derives from the three forms of TreeNode, namely binary node, unary node, and Leaft node. These nodes are formally composed of the catalyst syntax tree of Spark SQL.
The transform method of TreeNode is the core approach, which accepts a rule that recursively calls the child node of the current node to rule, and eventually returns a TreeNode copy, which is transformation, which runs through the spark Several core phases of SQL execution, such as the analyze,optimize phase.
Finally, a practical example shows the execution tree generation process of spark SQL.

My current understanding is these, if the analysis is not in place, please make a lot of corrections.
--eof--original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/38084079

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.