Spark SQL Source Code Analysis in-memory Columnar Storage in-memory Query

Source: Internet
Author: User

/**  Spark SQL source Code Analysis series Article * /

As mentioned earlier, the storage structure of spark SQL In-memory columnar storage is based on column storage.

So based on the above storage structure, we query the cache data in the JVM and how to query, this article will reveal the way to query in-memory data.

First, the primer This example uses the Hive console to query the SRC table after the cache. Select value from src

When we cache the SRC table into memory, we query src again, and we can observe the internal calls through the analyzed run plan.

That is, after the parse, a inmemoryrelation node is formed, and the method of Inmemorycolumnartablescan is called when the physical plan is finally run.

For example, the following:

scala> val exe = Executeplan (SQL ("Select value from src"). queryexecution.analyzed) 14/09/26 10:30:26 INFO parse. Parsedriver:parsing command:select value from Src14/09/26 10:30:26 INFO parse. Parsedriver:parse Completedexe:org.apache.spark.sql.hive.test.TestHive.QueryExecution = = = = Parsed Logical Plan = = Project [value#5] inmemoryrelation [key#4,value#5], False,, (Hivetablescan [key#4,value#5], (metastorerelation Default, SRC, none), none) = = Analyzed Logical Plan ==project [value#5] inmemoryrelation [key#4,value#5], False, + (Hiv Etablescan [key#4,value#5], (metastorerelation default, SRC, none), none) = = Optimized Logical Plan ==project [value#5] InM Emoryrelation [key#4,value#5], False,, (Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none = = Physical Plan ==inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], False, +, (hivetablescan [Key#4,value#5], (metastorerelation default, SRC, none), none)//Query the in-memory table's entry code Generation: false== RDD = = 

Second, Inmemorycolumnartablescaninmemorycolumnartablescan is a leaf node in the catalyst, including the attributes to be queried, and Inmemoryrelation (encapsulates our cached In-columnar storage data structure).
Run the leaf node and start the Execute method to query the memory data. 1. When querying, call Inmemoryrelation to operate on each partition of its encapsulated memory data structure. 2. Get the attributes to request, as above, the Value property of the SRC table is the query request. 3, the query expression according to the purpose, to obtain in the corresponding storage structure, the index index of the request column. 4, through the columnaccessor to each buffer to visit, get the corresponding query data, and encapsulated as a row object returned.

Private[sql] Case Class Inmemorycolumnartablescan (    attributes:seq[attribute],    relation: inmemoryrelation)   extends Leafnode {  override def Output:seq[attribute] = attributes  override def Execu  Te () = {    Relation.cachedColumnBuffers.mapPartitions {iterator =>     //Find the Ordinals of the requested columns.  if None is requested, use the first.      val requestedcolumns = If (attributes.isempty) {  &NB Sp     SEQ (0)      } else {        Attributes.map (a = Relation.output.indexW Here (_.exprid = = A.exprid))//Based on expression Exprid find the Bytebuffer index of the corresponding column      }      iterator  & nbsp    . map (Batch = Requestedcolumns.map (_)). Map (Columnaccessor (_)))//Bytebuffer of the corresponding request column from the index, and encapsulated as Columnaccessor.         FLATMAP {columnaccessors =>          val NextRow = new Genericmutablerow (columnaccessors.length)//row length           new Iterator[row] {    &N Bsp        override def next () = {              var i = 0              while (I < nextrow.length) {                column Accessors (i). Extractto (NextRow, I)//based on the corresponding index and length, get the value from Byterbuffer, package to row             & nbsp   i + = 1             }              NEXTROW&NB Sp          }            override def Hasnext = Columnaccessors.head. hasnext         }       }   } }}

Query the columns of the request, for example, the following:

scala> Exe.optimizedPlanres93:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [value#5] Inmemoryrelation [key#4,value#5], False,, (Hivetablescan [key#4,value#5], (metastorerelation default, SRC, None), None) scala> val relation = Exe.optimizedplan (1) Relation:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Inmemoryrelation [Key#4,value#5], False,, (Hivetablescan [key#4,value#5], (metastorerelation default, SRC, None), None) scala> val request_relation = Exe.executedPlanrequest_relation:org.apache.spark.sql.execution.SparkPlan = Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], False,, (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)) scala> Request_relation.output//Requested column, we requested only the Value column res95:seq[ Org.apache.spark.sql.catalyst.expressions.Attribute] = ArrayBuffer (value#5) scala> relation.output// All columns that are saved by default in relation Res96:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArRaybuffer (key#4, value#5) scala> val attributes = Request_relation.output attributes:seq[ Org.apache.spark.sql.catalyst.expressions.Attribute] = ArrayBuffer (value#5)


The process is very concise and the key step is the third step. The index of the request column is found by Exprid Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))
Find the corresponding idscala> val attr_index = Attributes.map (A = relation.output.indexWhere (_.exprid = A.exprid) based on Exprid) Attr_index:seq[int] = ArrayBuffer (1)//Find the requested column the index of value is 1, we query from the Bytebuffer of index 1, request data scala> Relation.output.foreach (E=>println (E.exprid)) Exprid (4)    //corresponding <span style= "font-family:arial, Helvetica, Sans-serif; " >[key#4,value#5]</span>exprid (5) scala> Request_relation.output.foreach (E=>println (E.exprid)) Exprid (5)
Third, Columnaccessor

Columnaccessor for each of these types, the class diagram is as follows:


Finally, a new iterator is returned:

          New Iterator[row] {            override def next () = {              var i = 0 while              (i < nextrow.length) {//Request column length                columnacces Sors (i). Extractto (NextRow, i)//Call Columntype.setfield (row, ordinal, extractsingle (buffer)) to resolve buffer                i + = 1              }              nextrow//returns the parsed row            }            Override def Hasnext = ColumnAccessors.head.hasNext          }

Iv. Summary

The query of Spark SQL in-memory columnar storage is relatively simple, and its query idea is mainly related to the stored data structure.

That is, when stored, each column is placed into a bytebuffer to form an Bytebuffer array.

When querying, the index of the above array is found according to the Exprid of the request column, then the fields in buffer are parsed using columnaccessor, and finally encapsulated as a row object, returned.

--eof--

Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/39577419

Spark SQL Source code Analysis in-memory Columnar Storage in-memory query

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.