/** Spark SQL Source Analysis series Article */
As mentioned earlier, the storage structure of spark SQL In-memory columnar storage is based on column storage.
So based on the above storage structure, we query how the cache data inside the JVM is queried, this article will reveal the way to query in-memory data.
First, the primer This example uses the Hive console to query the SRC table after the cache.
Select value from src
Once we have the SRC table cache in memory, we query src again, and we can observe the internal calls through the analyzed execution plan.
That is, after the parse, the Inmemoryrelation node is formed, and the method that inmemorycolumnartablescan the node is called when the physical plan is finally executed.
As follows:
[Java]View PlainCopy
- scala> val exe = Executeplan (SQL ("select value from src"). queryexecution.analyzed)
- 14/09/: + -INFO parse. parsedriver:parsing command:select value from src
- 14/09/: + -INFO parse. Parsedriver:parse completed
- Exe:org.apache.spark.sql.hive.test.TestHive.QueryExecution =
- = = Parsed Logical Plan = =
- Project [value#5]
- inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
- = = Analyzed Logical Plan = =
- Project [value#5]
- inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
- = = Optimized Logical Plan = =
- Project [value#5]
- inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
- = = Physical Plan = =
- Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], false, 1000, ( Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none) //Query the in-memory table's entry
- Code Generation: false
- = = RDD = =
Second, Inmemorycolumnartablescaninmemorycolumnartablescan is a leaf node in the catalyst, containing the attributes to be queried, and Inmemoryrelation (encapsulates our cached In-columnar storage data structure). Executes the leaf node and starts the Execute method to query the memory data. 1. When querying, call Inmemoryrelation to operate on each partition of its encapsulated memory data structure. 2. Get the attributes to request, as above, the Value property of the SRC table is the query request. 3, according to the purpose of the query expression, to get in the corresponding storage structure, the index of the request column. 4, through the columnaccessor to each buffer access, get the corresponding query data, and encapsulated as a row object returned.
[Java]View PlainCopy
- Private[sql] case class Inmemorycolumnartablescan (
- Attributes:seq[attribute],
- Relation:inmemoryrelation)
- extends Leafnode {
- Override def Output:seq[attribute] = attributes
- Override Def execute () = {
- relation.cachedColumnBuffers.mapPartitions {iterator =
- //Find The ordinals of the requested columns. If None is requested, use the first.
- Val requestedcolumns = if (attributes.isempty) {
- Seq (0)
- } Else {
- Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid)) //Find the Exprid index of the corresponding column according to the expression Bytebuffer
- }
- Iterator
- . map (Batch = Requestedcolumns.map (_)). Map (Columnaccessor (_)))//Bytebuffer of the corresponding request column according to the index, and encapsulated as Columnaccessor.
- . flatMap {columnaccessors =
- Val nextRow = length of new Genericmutablerow (columnaccessors.length)//row
- New Iterator[row] {
- Override Def next () = {
- var i = 0
- While (I < nextrow.length) {
- Columnaccessors (i). Extractto (NextRow, i) //based on the corresponding index and length, obtained from the Byterbuffer, encapsulated in the row
- i + = 1
- }
- NextRow
- }
- Override Def Hasnext = ColumnAccessors.head.hasNext
- }
- }
- }
- }
- }
The columns of the query request are as follows:
[Java]View PlainCopy
- Scala> Exe.optimizedplan
- Res93:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
- Project [value#5]
- inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
- Scala> val relation = Exe.optimizedplan (1)
- Relation:org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
- inmemoryrelation [key#4,value#5], false, + , (Hivetablescan [key#4,value#5], ( Metastorerelation default, SRC, none), none)
- scala> val request_relation = Exe.executedplan
- Request_relation:org.apache.spark.sql.execution.SparkPlan =
- Inmemorycolumnartablescan [value#5], (inmemoryrelation [key#4,value#5], false, 1000, ( Hivetablescan [key#4,value#5], (metastorerelation default, SRC, none), none))
- scala> request_relation.output //Requested column, we requested only the Value column
- Res95:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (value#5)
- scala> relation.output //All columns saved in relation by default
- Res96:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (key#4, value#5)
- scala> val attributes = Request_relation.output
- Attributes:seq[org.apache.spark.sql.catalyst.expressions.attribute] = ArrayBuffer (value#5)
The process is concise and the key step is the third step. The index of the request column is found according to Exprid
Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))
[Java]View PlainCopy
- Find the corresponding ID according to Exprid
- scala> val attr_index = Attributes.map (A = relation.output.indexWhere (_.exprid = = A.exprid))
- Attr_index:seq[int] = ArrayBuffer (1) //Find the requested column the index of value is 1, we query from the Bytebuffer of index 1, request data
- Scala> Relation.output.foreach (E=>println (E.exprid))
- Exprid (4) //corresponds to <span style= "font-family:arial, Helvetica, Sans-serif;" >[key#4,value#5]</span>
- Exprid (5)
- Scala> Request_relation.output.foreach (E=>println (E.exprid))
- Exprid (5)
Third, Columnaccessor
Columnaccessor corresponds to each of these types, the class diagram is as follows:
Finally, a new iterator is returned:
[Java]View PlainCopy
- New Iterator[row] {
- Override Def next () = {
- var i = 0
- While (I < nextrow.length) { //Request column length
- Columnaccessors (i). Extractto (NextRow, i)//Call Columntype.setfield (row, ordinal, extractsingle (buffer)) to resolve buffer
- i + = 1
- }
- NextRow//Returns the parsed row
- }
- Override Def Hasnext = ColumnAccessors.head.hasNext
- }
Iv. Summary
The query of Spark SQL in-memory columnar storage is relatively simple, and its query idea is mainly related to the stored data structure.
That is, when stored, each column is placed into a bytebuffer to form an Bytebuffer array.
When querying, the index of the above array is found based on the exprid of the requested column, then the fields in buffer are parsed using columnaccessor and finally encapsulated as a row object, which is returned.
--eof--
Create articles, reproduced please specify:
Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory
This article link address: http://blog.csdn.net/oopsoom/article/details/39577419
Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.
Transferred from: http://blog.csdn.net/oopsoom/article/details/39577419
Tenth: Spark SQL Source Analysis In-memory Columnar storage Source Analysis query