Spark Catalyst Source Analysis Sqlparser

Source: Internet
Author: User
Tags ord

The core execution process for Spark SQL has been analyzed and can be found in the Spark SQL core execution process, where we analyze the job responsibilities of each core component in the execution process.

In this paper, starting from the entrance analysis, that is, how to parse the SQL text generation logic plan, the main design of the core components of the sqlparser is a SQL language parser, using Scala implementation of parser to encapsulate the results of the resolution as Catalyst TreeNode, A follow-up article on the Catalyst Framework is presented.

First, SQL Parser entry SQL Parser is actually encapsulated a lot of Parser under Scala.util.parsing.combinator, and combined with some analytic methods under Parser, it forms the catalyst component logical Plan.

First Look at the flowchart:

A section of SQL is generated unresolved Logical Plan by SQL parser parsing.

In the source code is:

def SQL (sqltext:string): Schemardd = new Schemardd (this, Parsesql (sqltext))//sql ("Select Name,value from Temp_shengli") Instantiate a schemarddprotected[sql] def parsesql (sql:string): Logicalplan = parser (SQL)//instantiation Sqlparserclass Sqlparser extends    Standardtokenparsers with packratparsers {def apply (input:string): Logicalplan = {//Incoming SQL statement calls the Apply method, the input parameter is the SQL statement Special-case out set commands since the value fields can be/complex to handle without regexparsers.    Also This approach//are clearer for the several possible cases of SET commands. if (Input.trim.toLowerCase.startsWith ("set")) {Input.trim.drop (3). Split ("=", 2). Map (_.trim) match {case Arra Y ("") =//"Set" SetCommand (none, none) Case Array (key) =//"Set Key" SetCommand (Some (ke y), None) Case Array (key, value) =//"Set Key=value" SetCommand (Some (key), Some (value))}} E LSE {Phrase (query) (new lexical. Scanner (input)) match {case SUccess (r, x) = R case x = Sys.error (x.tostring)}} 

1. When we call SQL ("Select Name,value from Temp_shengli"), it's actually a new one Schemardd

2. When new Schemardd is constructed, the constructor invokes the Parsesql method, and the Parsesql method instantiates a sqlparser, which parser initialize calls its apply method.

3. Apply Method Branch:

3.1 If the SQL command is set at the beginning of the call SetCommand, this is similar to the parameter set in hive, SetCommand is actually a catalyst TreeNode Leafnode, is also inherited from Logicalplan, The TreeNode Library of Catalyst is not described in detail at this stage, and there will be articles to explain in detail later.

3.2 The key is the Else statement block, which is the core code of Sqlparser parsing sql:

Phrase (query) (new lexical. Scanner (input)) match {case        Success (r, x) = R case        x = Sys.error (x.tostring)      }
May phrase method Everyone is unfamiliar, do not know what is doing, then we first look at the Sqlparser class diagram:

The Sqlparser class inherits the Scala built-in collection parsers, the parsers. We can see that sqlparser now has the function of word breaker and can parse combiner statements (similar to P ~> Q, which is described later).

Phrase method:

  /** A Parser Generator delimiting whole phrases (i.e. programs).   *   *  ' phrase (p) ' succeeds if ' P ' succeeds and no input is left over after ' P '.   *   *  @param p The parser that must consume all input for the resulting parser   * to           succeed.   *  @return  A parser that have the same result as ' P ', but that's only succeeds   *           if ' P ' consumed all the I Nput.   *  /def Phrase[t] (p:parser[t]) = new Parser[t] {    def apply (in:input) = Lastnosuccessvar.withvalue (None) {
   
    p (in) match {case      s @ Success (out, in1) =        if (in1.atend)          s        else            Lastnosuccessvar.value Filternot {_.next.pos < In1.pos} getorelse Failure ("End of input expected", in1) case        ns = Lastnosuccessvar. Value.getorelse (NS)}}}  
   

Phraseis a method that iterates through the input characters, and if input in does not reach the last character, it continues parsing the parser until the last input character.

We notice that the Success class, which appears in the parser, also returns Success in the Else block:

/** The success case of ' Parseresult ': Contains the result and the remaining input.   *   *  @param result the parser ' output   *  @param next the   parser ' s remaining input  */Case Class Success[+t] (Result:t, override Val Next:input) extends Parseresult[t] {
Through the source, success encapsulates the parsing result of the current parser, and the statement that has not yet been parsed.

So the above judged the analytic results of success in1.atend? If the input stream ends, it returns s, the Success object, which contains the output of the Sqlparser parsing.


Second, the SQL parser core

In Sqlparser phrase accepts 2 parameters:

The first one is query, a parsing rule with schema, and the return is Logicalplan.

The second one is the lexical vocabulary scan input.

The process of Sqlparser parse is to accept the SQL keyword with the lexical vocabulary scan and use the query pattern to parse the rules-compliant SQL.


2.1 Lexical keyword defines the keyword class in sqlparser:
Protected Case Class Keyword (str:string)
In the spark1.0.0 version I used, I only supported SQL reserved words at this time:
 Protected Val all = Keyword ("All") protected Val and = Keyword ("and") protected Val as = Keyword ("as") protected Val A  SC = Keyword ("ASC") protected val approximate = Keyword ("approximate") protected Val AVG = Keyword ("AVG") protected Val by = Keyword ("by") protected Val CACHE = Keyword ("CACHE") protected Val CAST = Keyword ("CAST") protected Val COUNT = K  Eyword ("COUNT") protected Val DESC = Keyword ("DESC") protected Val DISTINCT = Keyword ("DISTINCT") protected Val FALSE = Keyword ("FALSE") protected Val first = Keyword ("First") protected Val from = Keyword (' from ') protected val full = Keyw Ord ("full") protected Val GROUP = Keyword ("GROUP") protected Val has = Keyword ("having") protected Val IF = Keyword (  "IF") protected val in = Keyword ("in") protected Val INNER = Keyword ("INNER") protected Val INSERT = Keyword ("INSERT") Protected Val into = Keyword ("into") protected Val was = Keyword ("is") protected Val JOIN = Keyword ("JOIN") protected V Al left = Keyword ("left") protected Val LIMIT = Keyword ("LIMIT") protected val max = Keyword ("MAX") protected Val MIN = Keyword ("MIN") protect ed val not = Keyword ("not") protected Val NULL = Keyword ("NULL") protected Val on = Keyword ("on") protected Val OR = Ke Yword ("OR") protected Val OVERWRITE = Keyword ("OVERWRITE") protected val like = Keyword ("like") protected Val rlike = K Eyword ("Rlike") protected Val UPPER = Keyword ("UPPER") protected Val LOWER = Keyword ("LOWER") protected Val REGEXP = Ke Yword ("REGEXP") protected Val ORDER = Keyword ("ORDER") protected Val OUTER = Keyword ("OUTER") protected Val right = Key Word ("right") protected Val SELECT = Keyword ("select") protected Val SEMI = Keyword ("SEMI") protected Val STRING = Keyw Ord ("STRING") protected Val SUM = Keyword ("SUM") protected Val TABLE = Keyword ("TABLE") protected Val TRUE = Keyword ("T RUE ") protected Val Uncache = Keyword (" Uncache ") protected Val UNION = Keyword (" UNION ") protected Val WHERE = Keyword (" WHERE ")
Here, based on these reserved words, reflection, generates a sqllexical
Override Val lexical = new Sqllexical (reservedwords)
Sqllexical uses its scanner this parser to read the input and pass it to query.
2.2 Queryquery is defined as Parser[logicalplan] and a bunch of strange connectors (in fact all parser methods, see), *,~, ^^ ^, look very confusing. Read through the source code, the following are a few common:
| is the alternation combinator. It says "succeed if either the left or right operand parse successfully"
The operator on the left and the operator on the right, as long as one succeeds, returns to succeed, similar to or

~ is the sequential combinator. It says "succeed if the left operand parses successfully, and then the right parses successfully on the remaining input"
When the operator on the left succeeds, the operator on the right calculates the succeeding input and returns succeed

Opt ' opt (p) ' is a parser that returns ' Some (x) ' if ' P ' returns ' X ' and ' None ' if ' P ' fails.
If the p operator succeeds then returns some (x) if the p operator fails, returns fails

^^ ^ ' P ^^ ^ v ' succeeds if ' P ' succeeds; Discards its result, and returns ' V ' instead.
If the operator on the left succeeds, cancels the result of the left operator and returns to the right operator.

~> says "succeed if the left operand parses successfully followed by the right, but does not include the left content in The result "
If both the operator on the left and the operator on the right are successful, the returned result does not contain the return value on the left.
Protected lazy val Limit:parser[expression] =
LIMIT ~> Expression

<~ is the reverse, "succeed if the left operand are parsed successfully followed by the right, but does not include the RI Ght content in the result "
This is in contrast to the ~> operator, if the operator on the left and the operator on the right are successful, the returned result does not contain the right
Termexpression <~ is ~ no ~ NULL ^^ {Case e = Isnotnull (e)} |

^^ {} or ^^ = = the Transformation combinator. It says "If the left operand parses successfully, transform the result using the function on the right"
Rep = simply says "Expect n-many repetitions of parser X" where x is the parser passed as a argument to rep
Deformed connector, meaning that if the operator on the left succeeds, the operator function on the right side of ^ ^ acts on the returned result
Next look at the definition of query:
Protected lazy val Query:parser[logicalplan] = (    select * (        UNION ~ all ^^ ^ {(Q1:logicalplan, Q2:logicalplan) = = Union (q1, q2)} |        Union ~ opt (DISTINCT) ^^ ^ {(Q1:logicalplan, Q2:logicalplan) = DISTINCT (Union (Q1, Q2))}      )    | insert | cache< c6/>)
Yes, the return is a parser, the type is Logicalplan. the definition of query is actually a pattern that uses many of the above operators, such as |, ^^, ~>, etc. given a SQL schema, such as Select,select xxx from yyy where the CCC =ddd if this notation is matched, returns success, otherwise returns failure.
The pattern here is that the Select mode can be followed by union all or union distinct. That is, the writing is legal, otherwise it is wrong.
Select B, from c Union allselect e,f from G
This * number is a repeat symbol, which can support multiple UNION ALL clauses.
It appears that the current spark1.0.0 only supports these three modes, select, INSERT, Cache.

How the hell did that build Logicalplan?? Let's look at one more detail:
Protected lazy val Select:parser[logicalplan] =    Select ~> opt (DISTINCT) ~ projections ~    opt (from) ~ Opt (filter  ) ~    opt (grouping) ~    opt (having) ~    opt (by) ~    opt (limit) <~ opt (";") ^^ {case      D ~ P ~ r ~ F ~ g ~ h ~ o ~ L        = val base = R.getorelse (Norelation)        val withfilter = f.map (f + = Filter (f, Base)). Getorels E (Base)        val withprojection =          G.map {g =            Aggregate (assignaliases (g), Assignaliases (p), withfilter)          }.getorelse (Project (Assignaliases (P), Withfilter))        val withdistinct = D.map (_ = = Distinct ( withprojection). Getorelse (withprojection)        val withhaving = H.map (H-= Filter (h, withdistinct)). Getorelse ( WITHDISTINCT)        val withorder = o.map (o = = Sort (o, withhaving)). Getorelse (withhaving)        val withlimit = l.map { L = Limit (L, Withorder)}.getorelse (withorder)        Withlimit  }
Here I call it the Select mode. See what patterns are supported for this SELECT statement: SELECT distinct projections from filter grouping has an by-limit.
Give a match to the select mode of SQL, note that with OPT connector is optional, can write distinct can also not write.
Select  game_id, user_name from Game_log where date<= ' 2014-07-19 ' and user_name= ' Shengli ' GROUP by GAME_ID have GA me_id > 1 game_id limit 50.

projectionsWhat is it? is actually an expression, is a SEQ type, a series of expressions can make game_id can also be game_id as Gmid. The return is indeed an expression, which is the catalyst TreeNode.
Protected lazy val Projections:parser[seq[expression]] = repsep (projection, ",")  protected lazy Val Projection:pars Er[expression] =    Expression ~ (opt (AS) ~> opt (ident)) ^^ {case      e ~ None = + E case      e ~ Some (a) = Ali As (E, a) ()    }

Mode. fromWhat's that? is actually a relations, is a relationship, in SQL can be a table, table join table
Protected lazy val From:parser[logicalplan] = from ~> relations
  Protected lazy val Relation:parser[logicalplan] =    joinedrelation |    Relationfactor  protected lazy val Relationfactor:parser[logicalplan] =    ident ~ (opt (AS) ~> opt (ident)) ^^ {
   case tableName ~ alias = unresolvedrelation (None, TableName, alias)    } |    ("~> Query ~") "~ opt (AS) ~ Ident ^^ {Case S ~ _ ~ _ ~ A = subquery (A, s)}   protected Lazy Val Joinedrelatio N:parser[logicalplan] =     relationfactor ~ opt (jointype) ~ JOIN ~ relationfactor ~ opt (joinconditions) ^^ {case      R 1 ~ JT ~ _ ~ R2 ~ Cond =        Join (r1, r2, JoinType = Jt.getorelse (Inner), cond)     }

This is actually the operation between the table, but the returned subquery is really a Logicalplan
Case class subquery (alias:string, Child:logicalplan) extends Unarynode {  override def output = Child.output.map (_.wi Thqualifiers (alias:: Nil))  override def references = Set.empty}

Scala has a lot of syntactic sugar, so it's easier to write, but it may be a bit obscure for beginners.
At this point we know how Sqlparser generated Logicalplan.


Third, summarize this article from the source code to analyze how spark Catalyst is to parse SQL into a logical plan. SQL text as input, instantiate the Sqlparser,sqlparser's apply method is called, processing 2 kinds of input, one is the command parameter, one is SQL. The corresponding command parameters will generate a leaf node, SetCommand, for the SQL statement, will call the parser phrase method, by the lexical scanner to scan input, participle,    Finally, the SQL schema that we defined by query uses parser's connector to verify that the SQL Standard is met, and if it does, then generates the Logicalplan syntax tree, which will prompt resolution failure. By parsing the Spark Catalyst SQL Parser, I understand how the syntax standards of the SQL language are implemented and how the SQL generation logical plan syntax tree is parsed.
--eof--

original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/37943507

Spark Catalyst Source code Analysis Sqlparser

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.