International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Spark SQL Catalyst Source Code Analysis physical Plan to RDD specific implementation

Last Update:2014-07-29 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After an article on spark SQL Catalyst Source Code Analysis Physical plan, this article will introduce the specifics of the implementation of the physical plan Tordd:
We all know a SQL, the real execution is when you call its collect () method to execute the spark Job, and finally calculate the RDD.

  Lazy val Tordd:rdd[row] = Executedplan.execute ()

The Spark plan basically contains 4 types of operations, the Basicoperator basic type, and the slightly more complex of join, aggregate, and sort. The general meaning of BasicOperator1.1, Project project is to pass in a series of expressions Seq[namedexpression], given the input row, by convert (expression's computed eval) operation, A new row is generated. The implementation of project is to call its Child.execute () method, and then call Mappartitions to operate on each of the partition.
This f function is actually a new mutableprojection, and then loops through each partition to convert.

Case Class Project (Projectlist:seq[namedexpression], Child:sparkplan) extends Unarynode {  override def output = proj Ectlist.map (_.toattribute)  override def execute () = Child.execute (). mappartitions {iter =//F mapping for each partition    @ Transient val reusableprojection = new Mutableprojection (projectlist)     Iter.map (reusableprojection)}  }

By observing the definition of mutableprojection, you can find the process of bind references to a schema and eval: Convert a row to another row that already has a schema column defined.
If the input row already has a schema, the incoming seq[expression] will also be bound to the current schema.

Case Class Mutableprojection (Expressions:seq[expression]) extends (row = row) {  def this (expressions:seq[ Expression], inputschema:seq[attribute]) = This    (Expressions.map (Bindreferences.bindreference (_, Inputschema)) )//bound schema  Private[this] val exprarray = Expressions.toarray  Private[this] val mutablerow = new Genericmuta Blerow (exprarray.size)//New row  def currentvalue:row = Mutablerow  def apply (input:row): row = {    var i = 0
   
    while (I < exprarray.length) {      Mutablerow (i) = Exprarray (i). Eval (input)  ///based on input, or a row, to compute the generated row      i + = 1    }    Mutablerow//Return to New row  }}

1.2. The specific implementation of the filter filter is the eval calculation of the incoming condition for multiple input row, and the last is a Boolean type, and if the expression evaluates to true, this data of this partition will be saved. Otherwise it will be filtered out.

Case Class Filter (Condition:expression, Child:sparkplan) extends Unarynode {  override def output = Child.output  Override Def execute () = Child.execute (). mappartitions {iter =    Iter.filter (Condition.eval (_). asinstanceof[ Boolean]//Evaluate expression eval (input row)  }}

1.3, sample sample operation is actually called the result of Child.execute (), the return is an RDD, the RDD call its sample function, the native method.

Case Class Sample (fraction:double, Withreplacement:boolean, Seed:long, Child:sparkplan)  extends unarynode{  Override def output = Child.output  //todo:how to pick seed?  Override Def execute () = Child.execute (). Sample (Withreplacement, fraction, Seed)}

1.4, the Union union operation supports multiple subqueries of union, so the incoming child is a Seq[sparkplan] execute () method implementation is for all of its children, every execute (), That is, the result collection of the Select query Rdd. Merges the results of all subqueries by calling the Union method of Sparkcontext.

Case Class Union (Children:seq[sparkplan]) (@transient sqlcontext:sqlcontext) extends Sparkplan {  //TODO: Attributes output by Union should is distinct for nullability purposes  override def output = Children.head.output  Override Def execute () = SqlContext.sparkContext.union (Children.map (_.execute ()))//Sub-query Results Union  override Def Othercopyargs = SqlContext:: Nil}

1.5. Limit limit operation is also available in the native API of the RDD, i.e. take (). However, the implementation of limit is divided into 2 cases: the first is the limit as the end of the operator, that is, select xxx from yyy limit zzz. and is called by Executecollect, the Take method is used directly in the driver. The second is that limit is not the end operator, that is, the limit is followed by a query, then the limit is called on each partition, and finally repartition to a partition to calculate the global limit.

Case Class Limit (Limit:int, Child:sparkplan) (@transient sqlcontext:sqlcontext) extends Unarynode {//Todo:implement A partition local limit, and use a strategy to generate the proper limit plan://partition local limit, Exchange I  Nto one partition-partition local limit again override def Othercopyargs = SqlContext:: Nil override def output = Child.output override Def executecollect () = Child.execute (). Map (_.copy ()). Take (limit)//call take override Def E directly in the driver Xecute () = {val Rdd = Child.execute (). mappartitions {iter = val Mutablepair = new Mutablepair[boolean, Row] ( ) Iter.take (limit). map (row = Mutablepair.update (false, Row))//per partition first calculates the limit} val part = new Hashpartitione    R (1) Val shuffled = new Shuffledrdd[boolean, row, row, Mutablepair[boolean, Row]] (RDD, part)//need shuffle, to repartition Shuffled.setserializer (Sparksqlserializer (new sparkconf (false))) Shuffled.mappartitions (_.take (limit). Map (_._ 2)//Last single partition to take LImit}}

1.6, takeordered takeordered is a sorted limit N, is generally used in the sort by operator after the limit. Can be simply understood as the TOPN operator.

Case Class takeordered (Limit:int, Sortorder:seq[sortorder], Child:sparkplan)                      (@transient sqlcontext:sqlcontext) Extends Unarynode {  override def Othercopyargs = SqlContext:: Nil  override def output = Child.output  @transie NT  lazy val ordering = new Rowordering (sortOrder)//Here is a sort by rowordering to achieve the  override Def executecollect () = Child . Execute (). Map (_.copy ()). takeordered (limit) (ordering)  //Todo:terminal split should be implemented differently From Non-terminal split.  Todo:pick Num splits based on |limit|.  Override Def execute () = SqlContext.sparkContext.makeRDD (Executecollect (), 1)}

1.7. Sort sort is also implemented by rowordering this class, Child.execute () maps each partition, each partition is sorted according to the order of rowordering, and a new ordered set is generated. It is also done by invoking the sorted method of the spark Rdd.

Case Class Sort (    Sortorder:seq[sortorder],    Global:boolean,    Child:sparkplan)  extends Unarynode {  override def requiredchilddistribution =    if (global) ordereddistribution (SortOrder):: Nil Else Unspecifieddistribution:: Nil  @transient  lazy val ordering = new Rowordering (sortOrder)//Sort order  override def execute () = Attachtree (this, "sort") {    //todo:optimize sorting operation?    Child.execute ()      . mappartitions (        iterator = Iterator.map (_.copy ()). toarray.sorted (Ordering). Iterator,//per partition call sorted method, incoming <span style= "font-family:arial, Helvetica, Sans-serif;" >ordering sorting rules, sorting </span>        preservespartitioning = True)  }  override def output = Child.output }

1.8, Existingrddexistingrdd is

Object Existingrdd {def converttocatalyst (a:any): any = a match {case o:option[_] + o.ornull case S:seq[an Y] = S.map (converttocatalyst) Case p:product = new Genericrow (P.productiterator.map (converttocatalyst). ToArra Y) case, other, and other} def producttorowrdd[a <: Product] (Data:rdd[a]): Rdd[row] = {data.mappartitions { iterator = if (iterator.isempty) {Iterator.empty} else {val bufferediterator = Iterator.buf fered val mutablerow = new Genericmutablerow (bufferedIterator.head.productArity) bufferediterator.map {R =&          Gt            var i = 0 while (i < mutablerow.length) {Mutablerow (i) = Converttocatalyst (R.productelement (i)) i + = 1} Mutablerow}}} def fromproductrdd[a <: Product:typetag] (prod Uctrdd:rdd[a]) = {Existingrdd (Scalareflection.attributesfor[a], Producttorowrdd (Productrdd))}}

To be continued:) Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/38274621

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

sqlite source code analysis lms implementation plan erp implementation plan template 6s implementation plan voip implementation project plan erm implementation plan sentiment analysis project source code in java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark SQL Catalyst Source Code Analysis physical Plan to RDD specific implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support