The previous articles explained the core execution process of spark SQL and the SQL parser of the Catalyst Framework for Spark SQL to accept user input SQL, parsed to generate unresolved Logical plan. We remember another core component analyzer in the execution process of Spark SQL, and this article will describe what the analyzer does in spark SQL.
Analyzer is located under the Catalyst's analysis package, and the main responsibility is to resolved out the logical Plan that SQL Parser failed to resolved.
First, the analyzer structure
Analyzer uses catalog and functionregistry to convert Unresolvedattribute and unresolvedrelation to all types of objects in the catalyst.
Analyzer has a FixedPoint object, a seq[batch].
Class Analyzer (Catalog:catalog, Registry:functionregistry, Casesensitive:boolean) extends ruleexecutor[ Logicalplan] with hivetypecoercion { ///Todo:pass this in as a parameter. Val FixedPoint = fixedpoint (+) val Batches:seq[batch] = Seq ( Batch ("Multiinstancerelations", Once, newrelationinstances), Batch ("Caseinsensitiveattributereferences", Once, (if (casesensitive) Nil Else Lowercaseattributereferences:: Nil): _*), Batch ("Resolution", FixedPoint, resolvereferences:: Resolverelations:: newrelationinstances:: implicitgenerate:: starexpansion:: Resolvefunctions:: globalaggregates:: typecoercionrules: _*), Batch ("Analysisoperators", FixedPoint, eliminateanalysisoperators) )
Some of the objects in Analyzer explain:
fixedpoint: equivalent to the upper limit of the number of iterations.
/** A strategy that runs until the fix point or maxiterations times, whichever comes first. * /Case Class FixedPoint (Maxiterations:int) extends strategy
Batch: Batch, this object is composed of a series of rule, adopt a strategy (the strategy is actually the alias that iterates several times, eg:once)
/** A batch of rules. * /, Protected case Class Batch (name:string, Strategy:strategy, rules:rule[treetype]*)
Rule : As a rule, this rule is applied to the logical Plan to transform unresolved into resolved
Abstract class Rule[treetype <: treenode[_]] extends Logging { /** Name for this Rule, automatically inferred based On class name. */ val rulename:string = { val className = Getclass.getname if (className endsWith "$") classname.dropright ( 1) Else ClassName } def apply (Plan:treetype): Treetype}
Strategy: The maximum number of executions, if the number of executions reached fix point before the maximum number of iterations, the policy stops and no longer applies.
/** * An execution strategy for rules that indicates the maximum number of executions. If the * execution reaches fix point (i.e. converge) before maxiterations, it'll stop. */ Abstract class Strategy {def maxiterations:int}
Analyzer parsing is mainly based on the policies and rule defined in these batch to parse the logical plan of unresolved.
Here the Analyzer class itself does not define the method of execution, but to find it from its parent class Ruleexecutor[logicalplan], Analyzer also implements the Hivetypecosercion, This class is the principle of automatic compatible conversions that refer to the type of hive.
Ruleexecutor: Executes the rule's execution environment, which executes the batch containing a series of rule, which is serial.
The specific method of execution is defined in apply:
As you can see here is a while loop, and the rules under each batch work on the current plan, which is iterative until fix point or the maximum number of iterations is reached.
def apply (plan:treetype): Treetype = {var Curplan = plan Batches.foreach {Batch = val Batchstartplan = Curplan var iteration = 1 var Lastplan = Curplan var continue = true//Run until fix point (or the MA X number of iterations as specified in the strategy. while (continue) {Curplan = Batch.rules.foldLeft (Curplan) {case (plan, rule) = Val Result = Rule (Plan)//This will invoke the Apply method of each different rule, unresolved relations,attrubute and function resolve if (!result.fastequals ( Plan) {Logger.trace (S "" "|=== applying Rule ${rule.rulename} = = = |${sidebyside (plan.treestring, result.treestring). mkstring ("\ n")} "" ". Stripmargin)} Result//Return effect after result plan} iteration + = 1 if (Iteration > Batch.strategy.maxIterations) {//If the number of iterations is already greater than the maximum number of iterations for the policy, stop the loop Logger.info (S "Max iterations ($iteration) ReacheD for Batch ${batch.name} ") Continue = false} if (Curplan.fastequals (Lastplan)) {//If no longer changes in multiple iterations, If you have a unique ID for plan, stop the loop. Logger.trace (S "Fixed point reached for batch ${batch.name} after $iteration iterations.") Continue = false} Lastplan = Curplan} if (!batchstartplan.fastequals (Curplan)) {LOGGER.D Ebug (S "" "|=== Result of Batch ${batch.name} = = = |${sidebyside (plan.treestring, Curplan.treest Ring). mkstring ("\ n")} "" ". Stripmargin)} else {Logger.trace (S" Batch ${batch.name} has no effect. ") }} Curplan//Return to resolved logical Plan}
The rules of the current spark SQL 1.0.0 are defined in the inner class of the Analyzer.scala. There are 4 batch definitions within the batches. Multiinstancerelations, Caseinsensitiveattributereferences, Resolution, analysisoperators four. These 4 batch classes are categorized by different rule types, each with a different strategy for resolve. 2.1, Multiinstancerelation If an instance appears multiple times in the logical plan, it will be applied newrelationinstances here rule
Batch ("Multiinstancerelations", Once, newrelationinstances)
Trait multiinstancerelation {def newInstance:this.type}
Object Newrelationinstances extends Rule[logicalplan] {def apply (Plan:logicalplan): Logicalplan = {val Localrelati ONS = plan Collect {case l:multiinstancerelation + L}//logical plan applies partial function to get all multiinstancerelation pla The set of n val multiappearance = localrelations. GroupBy (Identity[multiinstancerelation])//group by operation. Filter {C ASE (_, ls) = ls.size > 1}//If only a size greater than 1 is taken for subsequent operations. Map (_._1). Toset //update plan so that the expid of each instance is unique. Plan Transform {case l:multiinstancerelation if multiappearance contains L = l.newinstance}}}
2.2, Lowercaseattributereferences is also partital function, for the current plan application, all matching such as unresolvedrelation alias Alise converted to lowercase, Converts the alias of the subquery to lowercase as well. Summary: This is a rule that makes the property name case insensitive, because it lower all the properties to the.
Object Lowercaseattributereferences extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = plan Transform {case unresolvedrelation (databaseName, name, alias) = unresolvedrelation (databaseName, name, Alias.map (_.tolowercase) Case subquery (alias, child) = subquery (alias.tolowercase, child) case Q: Logicalplan = q transformexpressions {case S:star = s.copy (table = S.table.map (_.tolowercase)) Case Unr Esolvedattribute (name) = Unresolvedattribute (name.tolowercase) Case Alias (c, name) and alias (C, Name.tolowercase) () Case GetField (c, name) = GetField (c, Name.tolowercase)}}
2.3, resolvereferences the SQL parser parse out unresolvedattribute all to the corresponding actual catalyst.expressions.AttributeReference Attributereferences here calls the Resolve method of logical plan to convert the property to Namedexepression.
Object Resolvereferences extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = plan Transformup {
case Q:logicalplan If q.childrenresolved = Logger.trace (S "attempting to resolve ${q.simplestring}") Q transformexpressions {case u @ Unresolvedattribute (name) = //Leave unchanged if resolution fails. Hopefully'll be resolved next round. Val result = Q.resolve (name). Getorelse (U)//convert namedexpression logger.debug (S "Resolving $u to $result") Result } } }
2.4, ResolverelationsThis is a good understanding, remember the previous SQL parser, such as SELECT * from SRC, this src table after parse is a unresolvedrelation node. This step resolverelations called the Catalog object. The Catalog object maintains a hashmap result of a tablename,logical plan. Through this catalog directory to find the structure of the current table, so as to parse out the table fields, such as Unresolvedrelations will get a tablewithqualifiers. (i.e. tables and fields) This also explains why the flowchart, I will draw a catalog on it, because it is the meta data that analyzer needs to work.
Object Resolverelations extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = Plan Transform { Case Unresolvedrelation (databaseName, name, alias) = catalog.lookuprelation (databaseName, name, alias) } }
2.5, ImplicitgenerateIf there is only one expression in the SELECT statement, and the expression is a generator (generator is a mapping of 1 records generated to n Records) when the project node is encountered while parsing the logical plan, You can convert it to the Generate class (The Generate class is to apply a function to the input stream to generate a new stream).
Object Implicitgenerate extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = Plan Transform { Case Project (Seq (Alias (G:generator, _)), child) = = Generate (g, join = false, outer = False, None, child) }
}
2.6 starexpansionIn the project operator, if it is the * symbol, the SELECT * statement, you can expand all references, and the * in SELECT * expands into the actual field.
Object Starexpansion extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = Plan Tran Sform { //Wait until children is resolved case P:logicalplan if!p.childrenreso lved = p //If The projection list contains Stars, expand it. case P @ Project (Projectlist, child) if Containsstar (projectlist) = Project ( &N Bsp Projectlist.flatmap { Case S:star = S.expand (child.output)//Expand, enter the Attribut Eexpand (Input:seq[attribute]) converted to seq[namedexpression] Case o =: nil  ; }, child, case T:scripttransformation I F Containsstar (T.input) => t.copy ( input = T.input.flatmap {   Case S:star + s.expand (t.child.output) Case o = o: : nil } ) //If the aggregate function ar Gument contains Stars, expand it. Case a:aggregate if Containsstar (a.aggregateexpressions) =>  ; A.copy ( aggregateexpressions = a.aggregateexpressions.flatmap {&NB Sp Case S:star = S.expand (a.child.output) CA Se o = +:: nil } ) } /** * Returns True if ' Exprs ' contains a [[star]]. */ protected def Containsstar (Exprs:seq[expression]): Boolean = Exprs.collect {case _: Star = True}.nonempty }}
2.7 resolvefunctionsThis is similar to Resolvereferences, where the UDF is mostly resolve. These udfs are searched in Functionregistry.
Object Resolvefunctions extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = Plan Transform { Case Q:logicalplan = q transformexpressions {case u @ unresolvedfunction (name, children) if U.childrenresol ved = registry.lookupfunction (name, children)//See if the current UDF is registered}}
2.8 Globalaggregates A global aggregation and returns a aggregate if a project is encountered.
Object Globalaggregates extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = Plan Transform { Case Project (projectlist, child) if containsaggregates (projectlist) = Aggregate (Nil, projectlist, child) } def containsaggregates (Exprs:seq[expression]): Boolean = { Exprs.foreach (_.foreach {case agg: Aggregateexpression = return True Case _ = =} ) false } }
2.9 Typecoercionrules This is a compatible SQL syntax in hive, such as converting a string and int to each other, without the need to display a call to cast XXX as yyy. such as Stringtointegercasts.
Val typecoercionrules = propagatetypes:: convertnans:: widentypes:: promotestrings:: Booleancomparisons:: booleancasts:: stringtointegralcasts:: functionargumentconversion:: Castnulls:: Nil
2.10 Eliminateanalysisoperators removes the parsed operator, which only supports 2 types, one subquery needs to be removed, and the other is Lowercaseschema. These nodes will be removed from the logical plan.
Object Eliminateanalysisoperators extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = plan Transform {case subquery (_, child) = child//encounters subquery, does not regret itself, returns its child, that is, deleted the element case Lowercaseschema ( Child) = = Child }}
Third, the practice adds an example of yesterday debug, this example confirms how to apply the above rules to unresolved Logical Plan: When passing SQL statements, It does call resolvereferences to parse mobile into namedexpression. The execution process can be seen against this, with the unresolved Logical plan on the left and the Resoveld Logical plan on the right. First executes batch Resolution,eg: Calls resovelralation this rule to convert the unresovled Relation to Sparklogicalplan and finds its properties for the field through the catalog. The batch analysis Operator is then executed. Eg: Call eliminateanalysisoperators to remove the subquery. The format may not be displayed well, and you can drag down the scroll axis to the right to see the results. :)
Val exec = Sqlcontext.sql ("Select Mobile as MB, sid as ID, mobile*2 multi2mobile, COUNT (1) times from (SELECT * from Temp_ Shengli_mobile) A where pfrom_id=0.0 group by mobile, Sid, Mobile*2 ") 14/07/21 18:23:32 DEBUG SPARKILOOP$SPARKILOOPINTERPR Eter:Invoking:public static java.lang.String $line $eval. $print () 14/07/21 18:23:33 INFO Analyzer:max iterations (2) r eached for batch MULTIINSTANCERELATIONS14/07/21 18:23:33 INFO Analyzer:max iterations (2) reached for batch Caseinsensiti VEATTRIBUTEREFERENCES14/07/21 18:23:33 DEBUG analyzer$resolvereferences$: Resolving ' pfrom_id to PFROM_ID#514/07/21 18:23:33 Debug analyzer$resolvereferences$: Resolving ' Mobile to MOBILE#214/07/21 18:23:33 Debug analyzer$ resolvereferences$: Resolving ' sid to Sid#114/07/21 18:23:33 DEBUG analyzer$resolvereferences$: Resolving ' Mobile to Mobi LE#214/07/21 18:23:33 Debug analyzer$resolvereferences$: Resolving ' Mobile to MOBILE#214/07/21 18:23:33 Debug Analyzer$ resolvereferences$: Resolving ' sid to Sid#114/07/21 18:23:33 Debug analyzer$resolvereferences$: Resolving ' Mobile to MOBILE#214/07/21 18:23:33 DEBUG Analyzer: = = = Result of Batch Resolution ===! Aggregate [' Mobile, ' Sid, (' mobile * 2) as c2#27], [' Mobile as mb#23, ' Sid as Id#24, (' mobile * 2) as Multi2mobile#25,count (1) As times#26l] Aggregate [Mobile#2,sid#1, (CAST (mobile#2, Doubletype) * CAST (2, Doubletype)) as c2#27], [mobile#2 as Mb#2 3,sid#1 as Id#24, (CAST (mobile#2, Doubletype) * CAST (2, Doubletype)) as Multi2mobile#25,count (1) as times#26l]! Filter (' pfrom_id = 0.0) Filter (CAST (pfrom_id#5, doubletype) = 0.0) subquery A Subquery A! Project [*] Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9 , update_time#10,status#11,preserve_int#12]! Unresolvedrelation None, Temp_shengli_mobile, none Subquery temp_shengli_mobile! Sparklogicalplan (Existingrdd [Data_date#0,sid#1,mobi Le#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9, UPDATE_TIME#10,STATUS#11,PRESERVE_INT#12], mappartitionsrdd[4] at mappartitions at basicoperators.scala:174) 14/07/ 18:23:33 DEBUG Analyzer: = = = Result of Batch analysisoperators ===! Aggregate [' Mobile, ' Sid, (' mobile * 2) as c2#27], [' Mobile as mb#23, ' Sid as Id#24, (' mobile * 2) as Multi2mobile#25,count (1) As times#26l] Aggregate [Mobile#2,sid#1, (CAST (Mobile#2,Doubletype) * CAST (2, Doubletype)) as c2#27], [mobile#2 as mb#23,sid#1 as Id#24, (CAST (mobile#2, Doubletype) * CAST (2, Doub Letype) as Multi2mobile#25,count (1) as times#26l]! Filter (' pfrom_id = 0.0) Filter (CAST (pfrom_id#5, doubletype) = 0.0)! Subquery A Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,f effect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]! Project [*] Sparklogicalplan (Existingrdd [Data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_statu S#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11, preserve_int#12], mappartitionsrdd[4] at mappartitions at basicoperators.scala:174)! Unresolvedrelation None, Temp_shengli_mobile, none
Summarize this article from the source code point of View Analysis Analyzer in the SQL parser resolved unresolve Logical Plan analyze process, the execution of the process. The process is to instantiate a Simpleanalyzer, define some batch, and then traverse the batch in Ruleexecutor environment, execute the rules in batch, each rule will unresolved Logical Plan is resolve, some may not be parsed at one time, and multiple iterations are required until the max iteration number is reached or fix point is reached. The more commonly used rule here is resolvereferences, Resolverelations, Starexpansion, Globalaggregates, Typecoercionrules and Eliminateanalysisoperators.
--eof--original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/38025185
Spark SQL Catalyst Source Analysis Analyzer